Symbiose seminars

  • Swarm: robust and fast clustering method for amplicon-based studies

    Frédéric Mahé (Department of Ecology Technische Universität Kaiserslautern )
    Thursday, September 4, 2014 - 10:30
    Room Minquiers
    Popular de novo amplicon clustering methods suffer from two fundamental flaws: arbitrary global clustering thresholds, and input-order dependency induced by centroid selection. Swarm was developed to address these issues by first clustering nearly identical amplicons iteratively using a local threshold, and then by using clusters' internal structure and amplicon abundances to refine its results. This fast, scalable, and input-order independent approach reduces the influence of clustering parameters and produces robust operational taxonomic units, improving the amount of meaningful biological information that can be extracted from amplicon-based studies.

  • Arthropod Genome Sequencing at the Baylor College of Medicine Human Genome Sequencing Center.

    Stephen Richards (Baylor College of Medicine Human Genome Sequencing Center.)
    Thursday, June 26, 2014 - 10:30
    Room Aurigny
    We have long been pioneered the sequencing of insects genomes, from Drosophila melanogaster to Aphids, Beetles and Centipedes.As decreasing sequencing costs have allowed, we are expanding our investigations to the phylum of Arthropods. As a pilot for the insect 5,000genomes project, we are sequencing a pilot of 30 arthropod genomes, to identify practical issues and solutions for the selection, DNA isolation, sequencing, assembly, annotation, analysis and publication of multiple arthropod genomes.Here we describe examples demonstrating the power of the de-novo genome to drive biology, and the successes, problems and lessons learned so far from our pilot project. We also present the automated annotation pipeline used for the project. We hope that this project will inform larger projects in the future.  

  • Deciphering respective genome wide roles of bacteria within a community responsible for copper bioleaching metabolic processes: an integrative systems ecology approach

    Philippe Bordon (Univ. of Chile)
    Thursday, May 22, 2014 - 10:30
    Room Aurigny
    Bioleaching process consists in the extraction of metals from ores
    through the cooperative participation of several extremophile
    microorganisms. Due to its great industrial interest, different studies
    have extensively focused on identifying isolated contributions of single
    strains to the process. Even though these studies achieved important
    advances, the functioning of a bioleaching consortium as a whole remains
    far from being understood. From a holistic perspective, this
    presentation proposes a novel integrative systems ecology approach that
    aims to give a functional sense to a metagenomic consortium through the
    integration of genomic and metabolic knowledge at genome scale. Using
    public genome data of five bacterial strains involved in copper
    bioleaching: Acidiphilium cryptum, Acidithiobacillus ferrooxidans,
    Acidithiobacillus thiooxidans, Leptospirillum ferriphilum and
    Sulfobacillus thermosulfidooxidans, we first reconstructed a global
    integrative metabolic network. Next, using a parsimony assumption, we
    decipher a set of genes, called SGS, that take an active part in
    metabolic pathways related to bioleaching and are consecutive on their
    respective genomes, adding the constraint that the associated metabolic
    reactions are also closely connected within metabolic networks. Finally,
    SGS analysis showed that no segment is shared by five bacteria,
    suggesting that no single organism allows alone the copper bioleaching,
    but also pinpoints to the combination of bacterial interactions
    necessary for promoting these pathways, as well as the major hub role of
     A. cryptum. Overall, the SGS paradigm depicts genomic functional units
    and their respective role to maintain metabolic pathways, information
    that is crucial to genetically monitor bacterial participation as a
    whole in environmental processes
  • Enhancing reuse in scientific workflows

    Sarah Cohen-Boulakia (LRI, Université Paris-Sud)
    Thursday, May 15, 2014 - 10:45
    Room Aurigny
    Scientific workflows have been introduced to enhance reproducibility, share and reuse of in-silico experiments. Their simple programming model appeals to bioinformaticians, who can use them to specify complex data processing pipelines.

    In this talk, I will first present the results of a study we performed on workflow (re)use based on a large set of public scientific workflows: While the number of available scientific workflows is increasing along with their popularity, workflows are not (re)used and shared as much as they could be.

    I will then present several projects which aim at enhancing workflow reuse while focusing more specifically on the recent DistillFlow project. DistillFlow proposes to reduce the structural complexity of workflows to make workflows easier to understand for users. The refactoring approach followed in DistillFlow has provided very interesting results both in the 1,500 public workflows from and on the more curated workflow sets from the BioVel project (workflows to analyze biodiversity data).

  • Inférence des voies métaboliques chez les espèces non-modèles: de la génomique à la métabolomique

    Gabriel Markov (Tuebingen)
    Tuesday, April 15, 2014 - 10:30
    Room Aurigny
    Actuellement, pour savoir si une voie métabolique connue est présente chez une espèce non-modèle, les bioinformaticiens se concentrent sur la recherche d'enzymes orthologues dans l'espèce modèle la plus proche. Souvent, la présence de quelques enzymes orthologues est considérée comme une preuve suffisante de la conservation de la voie métabolique d'intérêt, mais ce raccourci n'est pas toujours justifié. Quelles sont les informations que fournit la génomique comparative sur la conservation des voies métaboliques, et en quoi la métabolomique s'avère-t-elle un complément indispensable pour l'étude à haut débit de la diversité métabolique chez les espèces non-modèles? 

  • La prédiction du noyau du repliement des protéines globulaires

    Jacques Chomilier (BiBiP, IMPMC, Université Pierre et Marie Curie, Paris)
    Thursday, April 10, 2014 - 10:30
    Room Aurigny
    Il existe plusieurs modèles pour décrire le repliement des protéines, c’est à dire la formation d’un globule compact après la synthèse de la chaîne peptidique dans le ribosome. Parmi ceux-ci, le modèle de nucléation-condensation stipule que sous l’effet de l’agitation thermique, des fluctuations du squelette mettent en contact des acides aminés répartis le long de la séquence. Ils constituent alors le noyau du repliement et nous nous intéressons à leur prédiction à partir de la séquence, par une simulation du repliement dans un espace discret avec une technique de Monte Carlo. Nous avons appelé MIR (Most Interacting Residues) les positions occupées par des acides aminés engagés dans un grand nombre de contacts non covalents. Leur comparaison avec les données expérimentales sera présentée.

  • Formalisation de réseaux de signalisation en logique

    Christine Froideveaux (LRI - INRIA AMIB - Université Paris Sud )
    Thursday, March 27, 2014 - 10:30
    Room Aurigny
    Dans la première partie de l'exposé nous présenterons une méthode basée sur la connaissance du domaine, qui permet de construire la topologie de réseaux moléculairesen exploitant des données expérimentales et des règles générales de raisonnement fournies par des experts.Nous montrerons comment cette méthode appliquée à des réseaux de signalisation permet de découvrir de nouvelles relations dans le réseau FSH.Dans une deuxième partie, nous introduirons une traduction du langage standard Systems Biology Graphical Notation Activity Flow (SBGN-AF) en programmation logique. Nous montrerons comment cette traduction peut être utilisée pour analyser la dynamique des réseaux SBGN-AF.

  • Operator-valued kernels for network inference

    Florence d'Alché-Buc (Université d’Evry-Val d’Essonne)
    Thursday, March 20, 2014 - 10:30
    Room Aurigny
    Reverse engineering of gene regulatory networks remainsa central challenge in computational systems biology, despite recent advances facilitated by benchmark in-silico challenges that have aidedin calibrating their performance. A number of approaches using either perturbation (knock-out) or wild-type time series data have appeared in the literature addressing this problem, with the latter employing linear temporal models.Nonlinear dynamical models are particularly appropriate for this inference task given the generation mechanism of the time series data. In this study, we introduce a novel nonlinear autoregressive model based on operator-valued kernels that simultaneously learns themodel parameters, as well as the network structure. As all kernel-based methods, this new model benefits from the regularization framework and a great flexibility. The empirical estimation of the  model's Jacobian matrix provides an estimation of the network structure.We propose a new learning method based on boosting.The performance of the proposed algorithm is evaluated on a number of benchmark data sets from the DREAM3 challenge and then, on real datasets related to the IRMA and T-cell networks. 

  • A framework based on probabilistic context-free grammars and a genetic algorithm for analysis of protein sequences

    Witold Dyrka (Inria Bordeaux)
    Thursday, February 27, 2014 - 10:30
    Room Aurigny
    Hidden Markov Models power many state-of-the-art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium and long-range residue-residue interactions. This requires an expressive power of at least context-free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. To address this problem, we have developed a probabilistic grammatical framework for problem-specific protein languages. The core of the model consists of a probabilistic context-free grammar (PCFG), automatically inferred by a genetic algorithm from only a generic set of expert-based rules and positive training sequences represented by physico-chemical properties. We tested the PCFG framework in the context of detection of ligand binding sites [1] and classfication of helix‐helix contact sites, where it outperformed the state-of-the-art [2]. Recently, we used the model to distinguish between amyloidogenic and non-amyloidogenic protein fragments and achieved good results (AUROC up to 0.80). A significant feature of the PCFG approach is the explanatory power of grammar rules and parse trees, which could provide biologically meaningful information. This is a joint work with Jean-Christophe Nebel, Malgorzata Kotulska and Florence Thirion.
    [1] Dyrka and Nebel. BMC Bioinformatics 2009, 10:323
    [2] Dyrka et al. Algorithms for Molecular Biology 2013, 8:31

  • Beyond N-gram modelling of documents

    Matthias Gallé (Xerox Grenoble)
    Thursday, February 6, 2014 - 10:30
    Room Aurigny
    The traditional way of modeling textual documents for text analytics is the bag-of-words or bag-of-ngrams approach. Besides the good performance of this lossy representation in machine learning applications it has some well known shortcomings due to the independence assumption of each n-gram.We propose an alternative representation based on repeated substrings of unbounded length (infinity-grams). In this talk we will show some applications, show how to overcome some computational challenges and will concentrate on the problem of recovering bigger chunks of texts when the only available information are n-grams.