Spectral methods for reconstructing latent orderings, and applications to de novo genome assembly.

Antoine Recanati (ENS)
Thursday, June 28, 2018 - 10:30
Room Aurigny
Talk abstract: 

The seriation problem seeks to recover a latent ordering from similarity information. We typically observe a matrix measuring pairwise similarity between a set of n elements and assume they have a serial structure, i.e. they can be ordered along a chain where the similarity between elements decreases with their distance within this chain. In the context of de novo prokaryotic genome assembly, within the Overlap-Layout-Consensus paradigm, we collect fragments of DNA (reads) randomly sampled across the genome, with sufficient coverage so that a given read overlaps with the neighboring reads. However, the position of the reads within the genome is unknown, so one has to solve a sort of jigsaw puzzle (the layout). This layout step can fit in the framework of Seriation, where the similarity measures the overlap (if any) between two reads, and we wish to reorder the reads such that two neighboring reads have a large overlap, and two reads far apart do not overlap.

In this talk, I will present a basic spectral method for Seriation, akin to the Spectral Clustering method widely used in Machine Learning, together with a simple extension that allows to deal with circular orderings and improves robustness to noise. I will then present results of this method applied to finding the layout of E. coli reads sequenced with Oxford Nanopore Technology MinION device.