Symbiose seminars

  • TBA

    Aurelien Naldi (ENS)
    Thursday, January 9, 2020 - 10:30 to 12:00
    Room Aurigny
    Talk abstract: 

    TBA

  • TBA

    Marine Jacquier (IGDR)
    Thursday, December 12, 2019 - 10:30 to 12:00
    Room Aurigny
    Talk abstract: 

    TBA

  • TBA

    Sylvain Glémin (EcoBio)
    Thursday, November 21, 2019 - 10:30 to 12:00
    Room Aurigny
    Talk abstract: 

    TBA

  • Learning clinical networks from medical records based on information estimates in mixed-type data

    Hervé Isambert (Institut Curie)
    Thursday, October 10, 2019 - 10:30
    Room Aurigny
    Talk abstract: 

    Network reconstruction aims at disentangling direct from indirect dependences in information-rich data and has become ubiquitous to analyze the rapidly expanding resources of genomic and clinical data. However, direct and indirect interdependences in mixed-type (continuous / categorical) clinical data are notoriously difficult to assess. To this end, we developed and implemented an efficient computational approach to simultaneously compute and assess the significance of multivariate information between any combination of mixed-type variables. The method is then used to uncover direct, indirect and possibly causal relationships between mixed-type data from medical records, by extending a recent machine learning method to reconstruct graphical models beyond simple categorical datasets. The method is shown to outperform existing tools on benchmark mixed-type datasets, before being applied to analyze the medical records of eldery patients with cognitive disorders from La Pitié-Salpêtrière Hospital, Paris, and breast cancer patients from Institut Curie hospitals.

  • TBA

    Patrick Dabert (IRSTEA)
    Thursday, October 3, 2019 - 10:30
    Room Aurigny
    Talk abstract: 

    TBA

  • Depicting microbial genomic diversity via a Partitioned Pangenome Graph

    Guillaume Gautreau (genoscope)
    Thursday, September 26, 2019 - 10:30
    Room Aurigny
    Talk abstract: 

    Thanks to the fascinating gush of newly sequenced genomes, genomics studies in microbiology now frequently rely on the comparison of hundreds to thousands of genomes of a single species. A consensus representation of multiple genomes would provide a better analytical framework than using individual reference genomes. This leads to a paradigm shift from the usual linear representation of reference genomes to a pangenome graph representation bringing together all the different known variations as multiple alternative paths. Classical pangenomic approaches (Medeni et al. 2005, Tettelin et al. 2005) use isolated sets of gene families partitioned in core (genes present in all the genomes of a species) or accessory genome (genes present in at least one genome of a species). Inspired by the methods released in the last few years, we propose to update the Tettelin's insights by organizing gene families in a pangenome graph to depict the microbial diversity. Some approaches have been developed to factorize pangenomes at the sequence level only (reviewed in Marschall et al. 2016). However, these approaches lack of direct information about genes, complicating the functional analyses from the study of the graph. The method introduced here, named PPanGGOLiN, can be considered as the missing link between the usual pangenomics approach (set of isolated gene families) and the pangenome graph at the sequence level.In current pangenomics approaches, core genes are most often defined as the set of ubiquitous genes in a clade. However, this definition has 2 major flaws: it is not robust against poorly sampled data because it is highly reliant on the presence/absence of genes in a single genome; it misses many core genes  because of the high probability to lose at least one of the core genes due to sequencing, assembly or annotation artifacts. In consequence, the core genome obtained from a large set of genomes can be very low requiring a relaxed definition of a core genome (generally using a fixed threshold of presence equals to 95% of the genomes). Unlike the few statistical approaches available to estimate a relaxed core genome without fixing an arbitrary threshold, PPanGGOLiN does not relies on the frequencies of gene family presence but uses the patterns of presence/absence and the pangenome graph to make the partitioning. This original approach is able to discriminate 2 sets of genes having the same frequencies of presence albeit coming from 2 different subsets of genomes. Moreover, the usual dichotomy between core and accessory genomes does not faithfully reports the diverse ranges of gene frequencies in a pangenome. Thereby, as proposed by Koonin et al. 2008 and formally modeled by Collins et al. 2012, the pangenome can be split into 3 groups. This choice helps to shed light on genes potentially associated with positive environmental adaptations while avoiding to confound them with potentially randomly acquired ones. For that purpose, based on the patterns of presence/absence and the pangenome graph, PPanGGOLiN divides the pangenome into (1) persistent genome, equivalent to a relaxed core genome (genes conserved in almost all genomes); (2) shell genome, moderately conserved genes potentially associated with environmental adaptation capabilities; (3) cloud genome, rare genes.Based on this partitioned pangenome representation, we can annotate nodes in the graph to highlight alternative paths and associate relevant metadata to them. Someway, drawing genomes on rails like a subway map may help biologists to browse the pangenome and compare their genomes of interest to the overall pangenomic diversity.

  • bistro: a library to build large-scale workflows in computational biology

    Philippe Veber (LBBE)
    Thursday, June 13, 2019 - 10:30 to 11:00
    Room Aurigny
    Talk abstract: 

    Computational pipelines for analyzing high-throughput genomics datasets typically consist of tens to hundreds of shell commands, generating thousands of files and running for days or weeks. While becoming rather complex pieces of software, they are most of the time still programmed using rudimentary tools like shell scripts, which offer very little help to develop large and reusable programs. In addition to being error-prone, implementing computational pipelines using shell scripts leaves lots of tedious aspects to the programmer, diverting her/his attention from data analysis considerations. In this work, I propose to leverage a modern, statically typed programming language to implement as a simple library a comfortable environment to develop bioinformatics pipelines. This library is named bistro and is written in the OCaml language. Among other features, it provides dependency tracking, parallel execution, resume-on-failure, automatic naming of intermediate files, easy deployment of pipelines using Docker or Singularity for enhanced reproducibility. Thanks to the compiler type checker, errors on file formats or typos in command arguments are detected at compile-time, that is even before running the pipeline. I'll show various benefits of embedding a pipeline development framework in a generalist language. Among other things, it becomes very easy to integrate a pipeline into a web server, or write extensible libraries of highly configurable pipelines.

  • From QC to isoform characterization : Evaluation and improvements of Nanopore sequencing in a RNASeq context

    Sophie Lemoine (IBENS)
    Thursday, June 6, 2019 - 10:30 to 11:00
    Room Aurigny
    Talk abstract: 

    Transcript identification is a real challenge with short read sequencing. With Oxford Nanopore Technologies (ONT), our aim is to sequence full-length cDNA to directly access isoforms. We have successfully validated analysis of differential expressed genes on a mouse model of myelination blockage following the standard ONT protocol. The mean length of our reads was 1.2kb, which is lower than the estimated 2kb mean length of the transcripts and even worse if we consider the TSL1 tagged transcripts (2.6kb). To improve our results, we combined SmartSeq and ONT technologies to synthesize full-length cDNA from total RNA. The cDNA were barcoded in order to sequence multiplesamples on a single MinION run and allow differential expression analyses. The SmartSeq/ONT protocol allowed us to sequence much longer cDNAs. The mean length of thereads was then about 2.6kb and the small reads that were the majority of the population with ONT protocol were eradicated. We were able to detect more differentially expressed targets. The targets detected were longer than the ONT protocol ones. The optimized protocol globally achieved a better 5’-3’ transcripts coverage and not surprisingly, for those longer than 2kb. If it does not ensure you have full-length cDNAs, it can be reliable for cDNA sequencing and improve isoform annotation andquantification using dedicated pipelines, such as FLAIR or Pinfish.The goal of my talk is to give an idea of :- the evolution of the protocols tested and improved;- the developments we had to perform to make the QC of our runs;- the ongoing evaluation of FLAIR and Pinfish in our context.

  • Approches génomiques d’étude de l’évolution des systèmes de détermination du sexe chez les poissons

    Yann Guiguen (INRA IPGP)
    Thursday, May 16, 2019 - 10:30 to 12:00
    Room Aurigny
    Talk abstract: 

    Les poissons présentent une grande variété de leurs mécanismes de détermination du sexe allant de systèmes purement génétiques à des systèmes déterminés complétement ou en partie par l'environnement (température, densité ...). Curieusement, cette variabilité ne suit aucun schéma phylogénétique évident, avec des transitions rapides au sein d’espèces étroitement apparentées, voire même au sein de populations différentes de la même espèce. Pour mieux comprendre cette diversité et les mécanismes qui régissent l’évolution des chromosomes sexuels, nous avons appliqué des approches de séquençages génomiques partiels (Rad-Sequencing) ou complets (Pool-Sequencing) sur un grand nombre d’espèces de poissons pour pouvoir caractériser les systèmes de détermination du sexe, délimiter les régions chromosomiques des loci sexuels et identifier des gènes candidats comme déterminants majeurs du sexe. Ces stratégies ont conduit à l'identification du type de déterminisme sexuel chez de nombreuses espèces avec des systèmes monofactoriels simples (XX/XY ou ZZ/ZW), mais également des espèces avec des systèmes de détermination du sexe plus complexes. Ces résultats nous ont aussi permis d’identifier de nouveaux gènes déterminants majeurs du sexe et de montrer que ceux-ci sont souvent « recrutés » dans un nombre relativement faible de voies de signalisation.

  • From alignment-free heuristics to an interactive visualization: V(D)J repertoire analysis in the Vidjil platform

    Mikaël Salson (CRIStAL U. Lille)
    Thursday, April 25, 2019 - 10:30 to 12:00
    Room Aurigny
    Talk abstract: 

    The diversity of the immune repertoire is grounded on V(D)J recombinations. Many algorithms and software identify these recombinations inside high-throughput sequencing data. We introduce new Aho-Corasick based heuristics to speed up the detection of V(D)J sequences in high-throughput sequencing data. We also show how those heuristics can speed up the identification of V(D)J recombinations. Our experiments show that those new heuristics improve time and space consumption of our previous algorithm — Vidjil-algo — while keeping its sensitivity and specificity. Such improvements are of importance when dozens of samples are to be analysed as is commonly the case in a clinical setting. In such a case users launch their analyses and interpret their results through a web application we have designed for this purpose.

Pages