Skip to Main content Skip to Navigation

Genetic risk score based on statistical learning

Abstract : Genotyping is becoming cheaper, making genotype data available for millions of indi-viduals. Moreover, imputation enables to get genotype information at millions of locicapturing most of the genetic variation in the human genome. Given such large data andthe fact that many traits and diseases are heritable (e.g. 80% of the variation of heightin the population can be explained by genetics), it is envisioned that predictive modelsbased on genetic information will be part of a personalized medicine.In my thesis work, I focused on improving predictive ability of polygenic models.Because prediction modeling is part of a larger statistical analysis of datasets, I de-veloped tools to allow flexible exploratory analyses of large datasets, which consist intwo R/C++ packages described in the first part of my thesis. Then, I developed someefficient implementation of penalized regression to build polygenic models based onhundreds of thousands of genotyped individuals. Finally, I improved the “clumping andthresholding” method, which is the most widely used polygenic method and is based onsummary statistics that are widely available as compared to individual-level data.Overall, I applied many concepts of statistical learning to genetic data. I used ex-treme gradient boosting for imputing genotyped variants, feature engineering to cap-ture recessive and dominant effects in penalized regression, and parameter tuning andstacked regressions to improve polygenic prediction. Statistical learning is not widelyused in human genetics and my thesis is an attempt to change that.
Document type :
Complete list of metadatas

Cited literature [234 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Wednesday, February 12, 2020 - 3:12:32 PM
Last modification on : Wednesday, October 7, 2020 - 1:20:04 PM
Long-term archiving on: : Wednesday, May 13, 2020 - 4:32:20 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02476202, version 1



Florian Privé. Genetic risk score based on statistical learning. Bioinformatics [q-bio.QM]. Université Grenoble Alpes, 2019. English. ⟨NNT : 2019GREAS024⟩. ⟨tel-02476202⟩



Record views


Files downloads