Symbiose seminars

  • Functional Analysis and Comparison of Gene Sets: Benefits of Using Semantic Similarity for Clustering DAVID Results

    Olivier Dameron (INSERM, Rennes 1) & Frédéric Hérault (INRA, UMR PEGASE)
    Thursday, October 11, 2012 - 10:30
    Room Aurigny
    Talk abstract: 

    Functional analysis of a set of genes consists in identifying its underlying biological features and is a challenging task. DAVID generates a list of enriched Gene Ontology (GO) terms and groups similar annotations into clusters ranked according to their enrichment score. However, its limitations are two-folds: it produces a lot of clusters of GO terms and it ignores the underlying semantics of Gene Ontology between these terms. We hypothesize that leveraging the semantics of Gene Ontology addresses both the quantity and the redundancy problems of DAVID and improves the functional analysis and comparisonof sets of genes. We propose to compute the semantic similarity of the clusters of GO terms returned by DAVID and to use it to group similar clusters.  We applied this approach on two sets of genes from a porcine muscular transcriptome study. To analyze a set of genes, it reduced the number of clusters respectively from twelve to four and from seventeen to five “super clusters”. These “super clusters” correspond respectively to four and four biologically-relevant processes. To compare the sets of genes, our approach successfully identified three similar functions shared by the two sets, as well as one significant function related to nucleic acid metabolic process that was specific to the second set. These results show that post-processing DAVID results using semantic similarity-based hierarchical clustering is relevant for the functional analysis and comparison of large sets of genes.  Keywords: Functional gene analysis, Gene Ontology, semantic similarity, hierarchical clustering. 

  • A combinatorial and integrated method to analyse RNA-seq reads

    Nicolas Philippe (Équipe MAB, Lirmm, Montpellier)
    Thursday, September 20, 2012 - 10:30
    Room Aurigny
    Talk abstract: 

    RNA sequencing enables a complete investigation covering the full dynamic spectrum of a transcriptome. It thus paves the way to a better understanding of the function of gene expression in different tissues, during development or pathological states. However, the splicing process, which generates both co-linear and non co-linear RNAs, the inclusion of sequencing errors, somatic mutations, polymorphisms, and rearrangements make the reads differ from the reference genome in a variety of ways. This complicates the task of comparing reads with a genome. Currently, the analysis paradigm consists in: 1. mapping the reads to a reference genome contiguously allowing as many differences as one expects to be necessary to accommodate sequence errors and small polymorphisms;2. using uniquely mapped reads to determine covered genomic regions, either for computing a local coverage to predict mutations and filter out sequence errors (cf. program ERANGE), or for delimiting expressed exons approximately (cf. program TopHat);3. re-aligning unmapped reads, which were not mapped contiguously at step one, to reveal splicing junctions. Limitations of this approach include lack of precision, redundant computations due to multi-mapping steps, error propagation due to heuristics and the absence of back-tracking. We propose a novel, integrated approach to analyze nowadays longer reads (> 50 bp). The idea is to adopt a k-mer approach that combines the genomic positions and local coverage to perform a complex analysis of each read and detect in a single step, mutations, indels, errors, as well as both normal and chimeric splice junctions. Comparisons with other tools demonstrate the feasibility of this approach, which yields both sensitive and highly specific inferences. 

  • High-Throughput Transcriptomics

    Micha Sammeth, (Centre Nacional d'Anàlisi Genòmica (CNAG), Barcelona)
    Thursday, June 14, 2012 - 11:30
    Room Aurigny
    Talk abstract: 

    In the seminar I will provide an introduction to the fascinating technique called RNA-Seq; after a brief historical overview about complementary techniques used earlier, we will review elementary preprocessing steps of these experiments. Then I will outline several different applications of RNA-Seq, and I will summarize some possible ways to analyze the data avoiding known pitfalls.

  • Vers un modèle de fonctionnement d'une communauté bactérienne de sédiments marins pollués par de l'arsenic

    Frédéric Plewniak (G.M.G.M - UdS/CNRS UMR7156 Strasbourg)
    Thursday, May 31, 2012 - 10:30
    Room Aurigny
    Talk abstract: 

    L'arsenic, à l'origine d'importantes pollutions de l'eau dans des zones industrielles et post-industrielles du monde entier, présente des risques sanitaires graves pour les populations. Les techniques de génomique environnementale permettent aujourd'hui d'étudier les stratégies adaptatives et coopératives des communautés microbiennes des milieux pollués.Nous avons séquencé des métagénome issus des sédiments portuaires de l'Estaque, proche d'un ancien site métallurgique hautement pollué par l'arsenic près de Marseille, et de St Mandrier, près de Toulon. L'analyse à l'aide du protocole RAMMCAP des séquences obtenues a permis d'établir les profils fonctionnels et taxonomiques des deux métagénomes et de quatre  métagénomes témoins disponibles dans les banques de données publiques.La biodiversité est plus importante dans les deux communautés sédimentaires par rapport à celles, dominées à plus de 80% par deux ordres, des sites témoins. L'ordre des Desulfobacterales représente 54.7% à l'Estaque et 31.7% à St Mandrier, tous les autres ordres présents étant répartis de manière relativement équitable. Toutefois la diversité microbienne est un peu plus importante à St Mandrier que sur le site hautement pollué de l'Estaque.Les ensembles Gene Ontology (GO) décrivant les profils fonctionnels ont été comparés afin de mettre en évidence les catégories sur-représentées dans les deux métagénomes d'intérêt par rapport aux quatre témoins. On observe ainsi une sur-représentation des catégories liées à la résistance à l'arsenic et aux réponses au stress oxydatif à l'Estaque. De plus, les données de métagénomique et les mesures physico-chimiques des paramètres environnementaux ont permis de proposer un modèle descriptif de fonctionnement des communautés procaryotiques mettant en évidence l'importance du cycle du soufre dans la détoxication de l'arsenic en relation avec la présence de bactéries réductrices du sulfate.

  • Fast and Accurate RNA-Seq read alignments with PALMapper

    Géraldine Jean (LINA, Université de Nantes)
    Thursday, May 10, 2012 - 10:30
    Room Minquiers
    Talk abstract: 

    High throughput sequencing of mRNA enhances transcriptome analysis and offers great opportunities for the discovery of new genes and the identification of alternative transcripts. However, the sheer amount of high throughput sequencing data requires efficient methods for accurate spliced alignments of reads against the reference genome, which is further challenged by the limited length and quality of the sequence reads.In this talk, I will present an original RNA-Seq read mapper, called PALMapper, that combines a faster extension of the high accurate alignment method QPALMA with the fast short read aligner GenomeMapper. PALMapper quickly carries out an initial read mapping which then guides a Banded Semi-Global alignment algorithm that allows for long gaps corresponding to introns. It computes both spliced and unspliced alignments at high accuracy by taking advantage of base quality information and computational splice site predictions brought together in an extended alignment scoring model.

  • LNA: Fast Protein Classification Using A Laplacian Characterization of Tertiary Structure

    Nicolas Bonnel (Université Bretagne Sud)
    Thursday, May 3, 2012 - 10:30
    Room Aurigny
    Talk abstract: 

    In the last two decades, a lot of protein 3D shapes have been discovered, characterized and made available thanks to the Protein Data Bank (PDB), that is nevertheless growing very quickly. New scalable methods are thus urgently required to search through the PDB efficiently. We present in this paper an approach entitled LNA (Laplacian Norm Alignment) that performs structural comparison of two proteins with dynamic programming algorithms. This is achieved by characterizing each residue in the protein with scalar features. The feature values are calculated using a Laplacian operator applied on the graph corresponding to the adjacency matrix of the residues. The weighted Laplacian operator we use estimates at various scales local deformations of the topology where each residue is located. On some benchmarks widely shared by the community we obtain qualitatively similar results compared to other competing approaches, but with an algorithm one or two order of magnitudes faster. 180,000 protein comparisons can be done within 1 seconds with a single recent GPU, which makes our algorithm very scalable and suitable for real-time database querying across the Web.

  • Métagénomique humaine : impacts cliniques

    Nicolas Pons (INRA Jouy en Josas)
    Thursday, April 26, 2012 - 10:30
    Room Aurigny
    Talk abstract: 

    La métagénomique humaine consiste à caractériser les associations entre les espèces et gènes microbiens et les phénotypes humains afin de développer des outils diagnostiques et pronostiques et des approches de modulation des populations microbiennes dans le but d'optimiser la santé de chacun. Les études de métagénomiques ont été facilitées ces dernières années avec le développement des technologies de séquençage et de criblage à très-haut débit. Dans ce séminaire, il sera présenté les quatre grands volets de la métagénomique : métagénomique fonctionnelle, métagénomique phylogénétique, métagénomique dite "whole sequencing" et métagénomique quantitative. Il sera porté une plus grande attention sur les deux derniers volets avec une illustration détaillée des derniers résultats obtenus dans les projets MicroObese et MetaHIT visant notamment à identifier les associations entre populations microbiennes et obésité.