Understanding and improving statistical models of protein sequences

Pierre Barrat-Charlaix

Thèse Année : 2018

Understanding and improving statistical models of protein sequences

Comprendre et améliorer les modèles statistiques de séquences de protéines

(1)

Pierre Barrat-Charlaix

Fonction : Auteur
PersonId : 779791
ORCID : 0000-0002-3816-3724

Biologie Computationnelle et Quantitative = Laboratory of Computational and Quantitative Biology

Résumé

In the last decades, progress in experimental techniques have given rise to a vast increase in the number of known DNA and protein sequences. This has prompted the development of various statistical methods in order to make sense of this massive amount of data. Among those are pairwise co-evolutionary methods, using ideas coming from statistical physics to construct a global model for protein sequence variability. These methods have proven to be very effective at extracting relevant information from sequences, such as structural contacts or effects of mutations. While co-evolutionary models are for the moment used as predictive tools, their success calls for a better understanding of they functioning. In this thesis, we propose developments on existing methods while also asking the question of how and why they work. We first focus on the ability of the so-called Direct Coupling Analysis (DCA) to reproduce statistical patterns found in sequences in a protein family. We then discuss the possibility to include other types of information such as mutational effects in this method, and then potential corrections for the phylogenetic biases present in available data. Finally, considerations about limitations of current co-evolutionary models are presented, along with suggestions on how to overcome them.

Dans les dernières décennies, les progrès des techniques expérimentales ont permis une augmentation considérable du nombre de séquences d'ADN et de protéines connues. Cela a incité au développement de méthodes statistiques variées visant à tirer parti de cette quantité massive de données. Les méthodes dites co-évolutives en font partie, utilisant des idées de physique statistique pour construire un modèle global de la variabilité des séquences de protéines. Ces méthodes se sont montrées très efficaces pour extraire des informations pertinentes des seules séquences, comme des contacts structurels ou les effets mutationnels. Alors que les modèles co-évolutifs sont pour l'instant utilisés comme outils prédictifs, leur succès plaide pour une meilleure compréhension de leur fonctionnement. Dans cette thèse, nous proposons des élaborations sur les méthodes déjà existantes tout en questionnant leur fonctionnement. Nous étudions premièrement sur la capacité de l'Analyse en Couplages Directs (DCA) à reproduire les motifs statistiques rencontrés dans les séquences des familles de protéines. La possibilité d'inclure d'autres types d'information comme des effets mutationnels dans cette méthode est présentée, suivie de corrections potentielles des biais phylogénétiques présents dans les données utilisées. Finalement, des considérations sur les limites des modèles co-évolutifs actuels sont développées, de même que des suggestions pour les surmonter.

Mots clés

Co-evolution Statistical models Statistical inference Proteins Maximum-entropy Statistical physics Phylogeny

Co-évolution Modèles statistiques Inférence statistique Protéines Entropie maximale Physique statistique Phylogénie

Domaines

Bio-informatique [q-bio.QM] Biochimie, Biologie Moléculaire

Fichier principal

these_BARRAT-CHARLAIX_Pierre_2018.pdf (10.77 Mo)

Origine : Version validée par le jury (STAR)

ABES STAR : Contact

https://theses.hal.science/tel-02866062

Soumis le : vendredi 12 juin 2020-11:25:26

Dernière modification le : lundi 15 avril 2024-15:16:20

Dates et versions

tel-02866062 , version 1 (12-06-2020)

Identifiants

HAL Id : tel-02866062 , version 1

Citer

Pierre Barrat-Charlaix. Understanding and improving statistical models of protein sequences. Bioinformatics [q-bio.QM]. Sorbonne Université, 2018. English. ⟨NNT : 2018SORUS378⟩. ⟨tel-02866062⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSERM CNRS STAR LCQB IBPS SORBONNE-UNIVERSITE THESES-SU SU-SCIENCES

141 Consultations

191 Téléchargements

Understanding and improving statistical models of protein sequences

Comprendre et améliorer les modèles statistiques de séquences de protéines

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager