Multivariate analysis of high-throughput sequencing data

Abstract : The statistical analysis of Next-Generation Sequencing data raises many computational challenges regarding modeling and inference, especially because of the high dimensionality of genomic data. The research work in this manuscript concerns hybrid dimension reduction methods that rely on both compression (representation of the data into a lower dimensional space) and variable selection. Developments are made concerning: the sparse Partial Least Squares (PLS) regression framework for supervised classification, and the sparse matrix factorization framework for unsupervised exploration. In both situations, our main purpose will be to focus on the reconstruction and visualization of the data. First, we will present a new sparse PLS approach, based on an adaptive sparsity-inducing penalty, that is suitable for logistic regression to predict the label of a discrete outcome. For instance, such a method will be used for prediction (fate of patients or specific type of unidentified single cells) based on gene expression profiles. The main issue in such framework is to account for the response to discard irrelevant variables. We will highlight the direct link between the derivation of the algorithms and the reliability of the results. Then, motivated by questions regarding single-cell data analysis, we propose a flexible model-based approach for the factorization of count matrices, that accounts for over-dispersion as well as zero-inflation (both characteristic of single-cell data), for which we derive an estimation procedure based on variational inference. In this scheme, we consider probabilistic variable selection based on a spike-and-slab model suitable for count data. The interest of our procedure for data reconstruction, visualization and clustering will be illustrated by simulation experiments and by preliminary results on single-cell data analysis. All proposed methods were implemented into two R-packages "plsgenomics" and "CMF" based on high performance computing
Document type :
Liste complète des métadonnées

Cited literature [235 references]  Display  Hide  Download
Contributor : Abes Star <>
Submitted on : Monday, September 4, 2017 - 2:10:19 PM
Last modification on : Thursday, March 21, 2019 - 2:51:22 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01581175, version 1


Ghislain Durif. Multivariate analysis of high-throughput sequencing data. Statistics [math.ST]. Université de Lyon, 2016. English. ⟨NNT : 2016LYSE1334⟩. ⟨tel-01581175⟩



Record views


Files downloads