Combining machine learning and evolution for the annotation of metagenomics data

Ari Ugarte

Résumé

Metagenomics is used to study microbial communities by the analyze of DNA extracted directly from environmental samples. It allows to establish a catalog very extended of genes present in the microbial communities. This catalog must be compared against the genes already referenced in the databases in order to find similar sequences and thus determine their function. In the course of this thesis, we have developed MetaCLADE, a new methodology that improves the detection of protein domains already referenced for metagenomic and metatranscriptomic sequences. For the development of MetaCLADE, we modified an annotation system of protein domains that has been developed within the Laboratory of Computational and Quantitative Biology clade called (closer sequences for Annotations Directed by Evolution) [17]. In general, the methods for the annotation of protein domains characterize protein domains with probabilistic models. These probabilistic models, called sequence consensus models (SCMs) are built from the alignment of homolog sequences belonging to different phylogenetic clades and they represent the consensus at each position of the alignment. However, when the sequences that form the homolog set are very divergent, the signals of the SCMs become too weak to be identified and therefore the annotation fails. In order to solve this problem of annotation of very divergent domains, we used an approach based on the observation that many of the functional and structural constraints in a protein are not broadly conserved among all species, but they can be found locally in the clades. The approach is therefore to expand the catalog of probabilistic models by creating new models that focus on the specific characteristics of each clade. MetaCLADE, a tool designed with the objective of annotate with precision sequences coming from metagenomics and metatranscriptomics studies uses this library in order to find matches between the models and a database of metagenomic or metatranscriptomic sequences. Then, it uses a pre-computed step for the filtering of the sequences which determine the probability that a prediction is a true hit. This pre-calculated step is a learning process that takes into account the fragmentation of metagenomic sequences to classify them. We have shown that the approach multi source in combination with a strategy of meta-learning taking into account the fragmentation outperforms current methods.

La métagénomique sert à étudier les communautés microbiennes en analysant de l’ADN extrait directement d’échantillons pris dans la nature, elle permet également d’établir un catalogue très étendu des gènes présents dans les communautés microbiennes. Ce catalogue doit être comparé contre les gènes déjà référencés dans les bases des données afin de retrouver des séquences similaires et ainsi déterminer la fonction des séquences qui le composent. Au cours de cette thèse, nous avons développé MetaCLADE, une nouvelle méthodologie qui améliore la détection des domaines protéiques déjà référencés pour des séquences issues des données métagénomiques et métatranscriptomiques. Pour le développement de MetaCLADE, nous avons modifié un système d’annotations de domaines protéiques qui a été développé au sein du Laboratoire de Biologie Computationnelle et Quantitative appelé CLADE (CLoser sequences for Annotations Directed by Evolution) [17]. En général les méthodes pour l’annotation de domaines protéiques caractérisent les domaines connus avec des modèles probabilistes. Ces modèles probabilistes, appelés Sequence Consensus Models (SCMs) sont construits à partir d’un alignement des séquences homologues appartenant à différents clades phylogénétiques et ils représentent le consensus à chaque position de l’alignement. Cependant, quand les séquences qui forment l’ensemble des homologues sont très divergentes, les signaux des SCMs deviennent trop faibles pour être identifiés et donc l’annotation échoue. Afin de résoudre ce problème d’annotation de domaines très divergents, nous avons utilisé une approche fondée sur l’observation que beaucoup de contraintes fonctionnelles et structurelles d’une protéine ne sont pas globalement conservées parmi toutes les espèces, mais elles peuvent être conservées localement dans des clades. L’approche consiste donc à élargir le catalogue de modèles probabilistes en créant de nouveaux modèles qui mettent l’accent sur les caractéristiques propres à chaque clade. MetaCLADE, un outil conçu dans l’objectif d’annoter avec précision des séquences issues des expériences métagénomiques et métatranscriptomiques utilise cette libraire afin de trouver des correspondances entre les modèles et une base de données de séquences métagénomiques ou métatranscriptomiques. En suite, il se sert d’une étape pré-calculée pour le filtrage des séquences qui permet de déterminer la probabilité qu’une prédiction soit considérée vraie. Cette étape pré-calculée est un processus d’apprentissage qui prend en compte la fragmentation de séquences métagénomiques pour les classer.Nous avons montré que l’approche multi source en combinaison avec une stratégie de méta apprentissage prenant en compte la fragmentation atteint une très haute performance.

Combining machine learning and evolution for the annotation of metagenomics data

La combinaison de l'apprentissage statistique et de l'évolution pour l'annotation des données métagénomiques

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager