Skip to Main content Skip to Navigation

Style du génome exploré par analyse textuelle de l'ADN

Abstract : DNA sequences can be considered as texts write in a 4-letters alphabet. A technique inspired from textual data analysis characterizes these sequences by short oligonucleotide (or word) frequencies. The whole word frequencies is called “genomic signature” (the “signature” term is justified because this set is species-specificity). Since the genomic signature can be observed in DNA segments as short as 1Kb, it appears to result from a “writing style” that characterizes the organization of DNA all over each genome. Moreover, proximities between species from the genomic signature point of view often correspond to proximities from the taxonomic point of view. However, the genomic signatures analysis is quickly confronted with limitations due to the curse of dimension. Indeed, the high dimensional data (the genomic signature generally has 256 dimensions) show unusual properties. For example, the concentration of Euclidean distances phenomenon is well known.
From these observations, we set up procedures to evaluate metrics in order to emphasize biological information extractable from genomic signatures. A associated non-linear method for vicinities' representation frees from the curse of dimension and allows to visualize space occupied by data. The analysis of relations between signatures poses the problem of the contribution of each variable (the words) to the distance between signatures. An original Z-score based on the variation of word frequencies along genomes make it possible to quantify these contributions. The comparison between “local signatures” permit to extract original regions. Besides, the precise segmentation of original regions is computed thanks to a method based on signal analysis.
From this set of methods, we can propose diverse biological results. In particular, we highlight an organization in the genomic signatures space coherent with species taxonomy. Moreover, we note the presence of a “DNA syntax” : there are “syntactic words” and “semantic words”. The signature is especially based on syntactic words. Lastly, the analysis of signatures along genome allows detection and precise segmentation of RNA and probable horizontal transfers. The convergence of the horizontal transfer styles towards host signature can besides be observed.
Diverse kind of results was obtained by signature analysis. Thus, ease of use and speed of the genomic signature analysis make it a powerful tool to extract biological information from genomes.
Document type :
Complete list of metadata
Contributor : Sylvain Lespinats <>
Submitted on : Monday, June 4, 2007 - 5:13:44 PM
Last modification on : Wednesday, December 9, 2020 - 3:11:07 PM
Long-term archiving on: : Friday, September 21, 2012 - 4:11:09 PM


  • HAL Id : tel-00151611, version 1


Sylvain Lespinats. Style du génome exploré par analyse textuelle de l'ADN. Sciences du Vivant [q-bio]. Université Pierre et Marie Curie - Paris VI, 2006. Français. ⟨tel-00151611⟩



Record views


Files downloads