BWT-based indexing structure for metagenomic classification

Karel Brinda (Université de Marne-la-Vallée)
Thursday, July 7, 2016 - 10:30
Room Aurigny
Talk abstract: 
Metagenomics is a powerful approach to study genetic content of environmental samples, which has been strongly promoted by NGS technologies. One of the main tasks is the assignment of reads of a metagenome to taxonomic units, and the subsequent abundance estimation. Most of recently developed programs for this task (such as LMAT, KRAKEN, KALLISTO) perform the assignment based on shared k-mers between reads and references. In such an approach, two major algorithmic subproblems can be distinguished: designing a k-mer index for a huge database of reference genomes and a given taxonomic tree, and designing an algorithm for assigning reads to taxonomic units from information on shared k-mers. In this talk, we consider the problem of index design and present a novel data structure that provides a full list of genomes containing a queried k-mer. The structure is based on BWT-index applied to sequences encoding k-mers proper to each node of the taxonomic tree. We analyse the usefulness of this index and evaluate it in terms of speed and memory requirements.