Variable selection in model-based clustering for high-dimensional data

Abstract : This thesis deals with variable selection for clustering. This problem has become all the more challenging since the recent increase in high-dimensional data where the number of variables can largely exceeds the number of observations (DNA analysis, functional data clustering...). We propose a variable selection procedure for clustering suited to high-dimensional contexts. We consider clustering based on finite Gaussian mixture models in order to recast both the variable selection and the choice of the number of clusters into a global model selection problem. We use the variable selection property of l1-regularization to build a data-driven model collection in a efficient way. Our procedure differs from classical procedures using l1-regularization as regards the estimation of the mixture parameters: in each model of the collection, rather than considering the Lasso estimator, we calculate the maximum likelihood estimator. Then, we select one of these maximum likelihood estimators by a non-asymptotic penalized criterion. From a theoretical viewpoint, we establish a model selection theorem for maximum likelihood estimators in a density estimation framework with a random model collection. We apply it in our context to determine a convenient penalty shape for our criterion. From a practical viewpoint, we carry out simulations to validate our procedure, for instance in the functional data clustering framework. The basic idea of our procedure, which consists in variable selection by l1-regularization but estimation by maximum likelihood estimators, comes from theoretical results we establish in the first part of this thesis: we provide l1-oracle inequalities for the Lasso in the regression framework, which are valid with no assumption at all contrary to the usual l0-oracle inequalities in the literature, thus suggesting a gap between l1-regularization and l0-regularization.
Document type :
Complete list of metadatas

Cited literature [108 references]  Display  Hide  Download
Contributor : Abes Star <>
Submitted on : Friday, November 16, 2012 - 10:42:13 AM
Last modification on : Friday, May 17, 2019 - 10:42:55 AM
Long-term archiving on : Saturday, December 17, 2016 - 11:16:27 AM


  • HAL Id : tel-00752613, version 1



Caroline Meynet. Variable selection in model-based clustering for high-dimensional data. General Mathematics [math.GM]. Université Paris Sud - Paris XI, 2012. English. ⟨NNT : 2012PA112234⟩. ⟨tel-00752613⟩



Record views


Files downloads