Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem

Matthias Gallé 1
1 SYMBIOSE - Biological systems and models, bioinformatics and sequences
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Motivated by the goal of discovering hierarchical structures inside DNA sequences, we address the Smallest Grammar Problem, the problem of finding a smallest context-free grammar that generates exactly one sequence. This NP-Hard problem has been widely studied for applications like Data Compression, Structure Discovery and Algorithmic Information Theory. From the theoretical point of view, our contributions to this problem is a new formalisation of the Smallest Grammar Problem based on two complementary optimisation problems: the choice of constituents of the final grammar and the choice of how to parse the sequence with these constituents. We give a polynomial time solution for this last problem, which me named the ''Minimal Grammar Parsing" problem. This decomposition allows us to define a new complete and correct search space for the Smallest Grammar Problem. Based on this search space, we propose new algorithms able to return grammars 10\% smaller than the state of the art on complete genomes. Regarding efficiency, we study different equivalence classes of repeats and introduce an efficient in-place schema to update the suffix array data structure used to compute these words. We conclude this thesis analysing the applications. For Structure Discovery, we consider the impact of the non-uniqueness of smallest grammars. We prove that the number of smallest grammars can be exponential in the size of the sequence and then analyse the stability of the discovered structures between minimal grammars for real-life examples. With respect to Data Compression, we extend our algorithms to use rigid patterns as words and achieve compression rate up to 25\% better compared to the previous best DNA grammar-based coder.
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-00595494
Contributor : Matthias Gallé <>
Submitted on : Tuesday, May 24, 2011 - 9:27:14 PM
Last modification on : Friday, November 16, 2018 - 1:23:49 AM
Long-term archiving on : Friday, November 9, 2012 - 12:06:49 PM

Identifiers

  • HAL Id : tel-00595494, version 1

Citation

Matthias Gallé. Searching for Compact Hierarchical Structures in DNA by means of the Smallest Grammar Problem. Modeling and Simulation. Université Rennes 1, 2011. English. ⟨tel-00595494⟩

Share

Metrics

Record views

1346

Files downloads

1769