Complexity in Genomic Patterns and Classification

Somdatta Sinha (Indian Institute of Science Education and Research Mohali, India)
Wednesday, May 27, 2015 - 14:00
Room Minquiers
Talk abstract: 

Genomes are made of sequences of four nucleotides, A, T, C, and G.  Several processes such as mutation, transposition, recombination, translocation, and excision introduce variations in these sequences, which then become the substrates of selection and consequent evolution. Similarity in the liner composition of these letters in two sequences are commonly used as indicators of evolutionary closeness of two organisms. However, researchers  are increasingly looking at groups of letters (“words”), or different patterns of nucleotide sequences ("genomic signatures"), and have found that DNA of closely related organisms also have similar genomic signatures. This encourages us to look into the compositional properties of DNA sequences and their relevance to function and evolution. In this talk, I will discuss how these patterns can be used for alignment-free classification of very closely related DNA sequences using Chaos Game Representation (CGR). This points towards the role of higher order word structures carrying some meaning in the DNA language, and the interplay of complex word structures and biological information processing. Long range correlations are also known to exist in genomes at different length scales, and genome sequences have been shown to be multi-fractals. I will also show that the multi fractal properties of these DNA sequences can be used to classify very closely related organisms (sub and sub-subypes of HIV-1 strains). The questions to be explored are the origin of the compositional complexity in DNA, and its functional and evolutionary implications.