Skip to Main content Skip to Navigation

Understanding and improving statistical models of protein sequences

Abstract : In the last decades, progress in experimental techniques have given rise to a vast increase in the number of known DNA and protein sequences. This has prompted the development of various statistical methods in order to make sense of this massive amount of data. Among those are pairwise co-evolutionary methods, using ideas coming from statistical physics to construct a global model for protein sequence variability. These methods have proven to be very effective at extracting relevant information from sequences, such as structural contacts or effects of mutations. While co-evolutionary models are for the moment used as predictive tools, their success calls for a better understanding of they functioning. In this thesis, we propose developments on existing methods while also asking the question of how and why they work. We first focus on the ability of the so-called Direct Coupling Analysis (DCA) to reproduce statistical patterns found in sequences in a protein family. We then discuss the possibility to include other types of information such as mutational effects in this method, and then potential corrections for the phylogenetic biases present in available data. Finally, considerations about limitations of current co-evolutionary models are presented, along with suggestions on how to overcome them.
Complete list of metadatas

Cited literature [157 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Friday, June 12, 2020 - 11:25:26 AM
Last modification on : Monday, August 31, 2020 - 12:30:19 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02866062, version 1


Pierre Barrat-Charlaix. Understanding and improving statistical models of protein sequences. Bioinformatics [q-bio.QM]. Sorbonne Université, 2018. English. ⟨NNT : 2018SORUS378⟩. ⟨tel-02866062⟩



Record views


Files downloads