# Chaînes de Markov régulées et approximation de Poisson pour l'analyse de séquences biologiques

Abstract : The statistical analysis of biological sequence such as nucleotidic sequences (DNA and RNA) or amino-acids (proteins) needs the conception of different models according to the study. Since the way the nucleotides succeed one another in DNA sequences is dependant, Markov models are widely used for this purpose. The problem of these models is to consider the homogeneity of biological
sequences. But, biological sequences are not homogeneous. A well-known example is the gc percent: along a sequence, gc-rich regions and gc-poor regions succeed one another. In order to take into account this heterogeneity, other models are used: the hidden Markov models (HMM). The sequence is divided in some homogeneous regions. There is a lot of applications to HMM, such as search of coding regions. But, all biological particularities can not appear under these models, that is why we develop new models: the drifting Markov models (DMM). Instead of fitting a transition matrix on a whole sequence (classical Markov model) or different transition matrices on different homogeneous parts of the sequence (HMM), we allow the transition matrix to vary (to drift) from the beginning to the end of the sequence. At each position t, we obtain a different transition matrix Πt/n (where n is the sequence length). Thus, our models are constrained heterogeneous Markov models. We give two ways to constrain models: polynomial DMM and polynomial splines DMM. For instance, for a degree 1 DMM (linear drift), we fix a transition matrix Π0 at the beginning of the sequence and transition matrix Π1 at the end of the sequence and we allow the transition matrix to vary linearly from Π0 to Π1.:
Πt/n = (1-t/n) Π0 + t/n Π1.
Such a model could correspond to a soft evolution between two hidden states of an HMM, for which transitions could appear too sudden. DMM can be seen as a competitive model to the HMM one but it over all can be understood as a complementary tool: the hidden models of an HMM, usually fixed Markov chains can be replaced by DMM. Along this work, we consider polynomial drift or drift by polynomial splines (in the way to make them more flexible than the polynomial ones). We estimate our models by different ways, evaluate their qualities and used them in biological applications such as the search of rare words. We develop the software DRIMM (soon available at http://stat.genopole.cnrs.fr/sg/software/drimm/), dedicated to estimation of DMM. This program provide all the possibilities of DMM, such as computation of transition matrix in each position, computation of stationary laws... Use of this program for the search of rare words is proposed in auxiliary programs (available on request).
This work provides some perspectives. Instead of allowing the transition matrix to vary only with the position t, we could take into account covariables such as, hydrophobicity degree, gc-percent, an indicator of the protein structure (α-helix, β-sheet,...). But the main perspective stay the possibility to combine HMM and DMM, with DMM in the role of hidden states.
Keywords :
Document type :
Theses
Domain :

https://tel.archives-ouvertes.fr/tel-00322434
Contributor : Nicolas Vergne <>
Submitted on : Wednesday, September 17, 2008 - 4:42:52 PM
Last modification on : Tuesday, March 17, 2020 - 3:09:56 AM
Long-term archiving on: : Friday, June 4, 2010 - 11:30:26 AM

### Identifiers

• HAL Id : tel-00322434, version 1

### Citation

Nicolas Vergne. Chaînes de Markov régulées et approximation de Poisson pour l'analyse de séquences biologiques. Mathématiques [math]. Université d'Evry-Val d'Essonne, 2008. Français. ⟨tel-00322434⟩

Record views