Skip to Main content Skip to Navigation

Model selection for sparse high-dimensional learning

Abstract : The numerical surge that characterizes the modern scientific era led to the rise of new kinds of data united in one common immoderation: the simultaneous acquisition of a large number of measurable quantities. Whether coming from DNA microarrays, mass spectrometers, or nuclear magnetic resonance, these data, usually called high-dimensional, are now ubiquitous in scientific and technological worlds. Processing these data calls for an important renewal of the traditional statistical toolset, unfit for such frameworks that involve a large number of variables. Indeed, when the number of variables exceeds the number of observations, most traditional statistical techniques become inefficient. First, we give a brief overview of the statistical issues that arise with high-dimensional data. Several popular solutions are presented, and we present some arguments in favor of the method utilized and advocated in this thesis: Bayesian model uncertainty. This chosen framework is the subject of a detailed review that insists on several recent developments. After these surveys come three original contributions to high-dimensional model selection. A new algorithm for high-dimensional sparse regression called SpinyReg is presented. It compares favorably to state-of-the-art methods on both real and synthetic data sets. A new data set for high-dimensional regression is also described: it involves predicting the number of visitors in the Orsay museum in Paris using bike-sharing data. We focus next on model selection for high-dimensional principal component analysis (PCA). Using a new theoretical result, we derive the first closed-form expression of the marginal likelihood of a PCA model. This allows us to propose two algorithms for model selection in PCA. A first one called globally sparse probabilistic PCA (GSPPCA) that allows to perform scalable variable selection, and a second one called normal-gamma probabilistic PCA (NGPPCA) that estimates the intrinsic dimensionality of a high-dimensional data set. Both methods are competitive with other popular approaches. In particular, using unlabelled DNA microarray data, GSPPCA is able to select genes that are more biologically relevant than several popular approaches.
Complete list of metadata

Cited literature [344 references]  Display  Hide  Download
Contributor : Pierre-Alexandre Mattei <>
Submitted on : Tuesday, December 5, 2017 - 11:51:58 AM
Last modification on : Tuesday, November 24, 2020 - 8:26:01 PM


Files produced by the author(s)


  • HAL Id : tel-01655924, version 1


Pierre-Alexandre Mattei. Model selection for sparse high-dimensional learning. Statistics [math.ST]. Université Paris 5, 2017. English. ⟨tel-01655924⟩



Record views


Files downloads