Filtrage de séquences d'ADN pour la recherche de longues répétitions multiples

Abstract : Since a few years, molecular genomics has had to deal with new
situations. First, the amount of data available is increasing
exponentially. Second, research in this domain involves some new
questions which lead to problems that are algorithmically difficult to
solve.

Among such problems, some are related to the study of genomic
rearrangements, including duplicated and transposable elements. Such a task requires the capacity to detect accurately and efficiently long multiple approximate repetitions in the genomes. A multiple repetition refers to a repetition having at least two copies in a DNA sequence, or having copies in a least two distinct DNA sequences. Furthermore, the repetitions involved are called approximate because their occurrences are distant from another by some errors like insertions, deletions and substitutions.

The problem of searching for long multiple approximate repetitions may be solved by multiple local alignment algorithms. Such algorithms have a complexity that is exponential with the size of the input. Therefore they cannot be applied to data as big as genomes. This is the reason why new techniques have to be created to address these new problems.

In this PhD thesis, a filtration approach for comparing DNA sequences is proposed. The goal of this approach is to remove accurately and efficiently, from texts representing DNA, large portions that cannot contain an occurrence of a repetition. Filtered data, which in general will then correspond to the relevant portions, may be used as input of a multiple local alignment algorithm.

The filters proposed apply a necessary condition on the sequences. Only portions of sequences respecting this condition are
conserved. The work presented deals with the creation of filtration
conditions. Such conditions have to be both efficient and, from an
algorithmic point of view, easy to apply. Using the provided
filtration conditions, two filters, Nimbus and Ed'Nimbus were created. These
filters are called exact because the condition applied guarantees that
no relevant part of the data may be filtered out. Its efficiency, both
in terms of the accuracy of the filtration and of the time consumption, leads to very good practical results. For instance, the time spent by repetition extraction algorithms or multiple alignment algorithms may be reduced by several orders of magnitude using one of the proposed filters.

It is worth to notice that the work presented in this PhD thesis was
motivated by biology, however, it is generic and can thus be used to
filter of any other kinds of text with the aim to detect long multiple
repeated portions.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-00132300
Contributor : Pierre Peterlongo <>
Submitted on : Wednesday, February 21, 2007 - 9:40:37 AM
Last modification on : Tuesday, January 29, 2019 - 10:02:08 PM
Long-term archiving on : Friday, September 21, 2012 - 11:35:56 AM

Identifiers

  • HAL Id : tel-00132300, version 1

Collections

Citation

Pierre Peterlongo. Filtrage de séquences d'ADN pour la recherche de longues répétitions multiples. Interface homme-machine [cs.HC]. Université de Marne la Vallée, 2006. Français. ⟨tel-00132300⟩

Share

Metrics

Record views

477

Files downloads

392