Skip to Main content Skip to Navigation

Variable importance measures in semiparametric and high-dimensional models with or without error-in-variables

Abstract : During the last few decades, the advancements in technology we witnessed have considerably improved our capacities to collect and store large amount of information. As a consequence, they enhanced our data mining potential. The repercussions, on multiple scientific fields, have been stark. In statistical analysis for example, many results derived under the then common low dimensional framework, where the number of covariates is smaller than the size of the dataset, had to be extended. The literature now abounds with significant contributions in high dimensional settings. Following this path, the current thesis touches on the concept of variable importance that is, a methodology used to assess the significance of a variable. It is a focal point in today’s era of big data. As an example, it is often use for prediction models in high dimensional settings to select the main predictors. Our contributions can be divided in three parts.In the first part of the thesis, we rely on semiparametric models for our analysis. We introduce a multivariate variable importance measure, defined as a sound statistical parameter, which is complemented by user defined marginal structural models. It allows one to quantify the significance of an exposure on a response while taking into account all other covariates. The parameter is studied through the Targeted Minimum Loss Estimation (TMLE) methodology. We perform its full theoretical analysis. We are able to establish consistency and asymptotic results which provide as a consequence p-values for hypothesis testing of the parameter of interest. A numerical analysis is conducted to illustrate theoretical results. It is achieved by extending the implementation of the TMLE.NPVI package such that it is able to cope with multivariate parameter.In the second part, we introduce a variable importance measure which is defined through a nonparametric regression model under a high dimensional framework. It is partially derived from the parameter described in the first part of the thesis, without the requirement that the user provides a marginal structural model. The regression model comes with the caveat of having a data structure which, in some cases, is subject to measurement errors. Using a high-dimensional projection on an orthonormal base such as Fourier series, smoothing splines and the Lasso methodology, we establish consistency and the convergence rates of our estimators. We further discuss how these rates are affected when the design of the dataset is polluted. A numerical study, based on simulated and on financial datasets, is provided.In the third and final part of this thesis, we consider a variable importance measure defined through a linear regression model subject to errors-in-variables. This regression model was derived in the previous chapter. The estimation of the parameter of interest is done through a convex optimization problem, obtained by projecting the empirical covariance estimator on the set of symmetric non-negative matrices, and using the Slope methodology. We perform its complete theoretical and numerical analysis. We establish sufficient conditions, rather restrictive on the noise variables, under which to attain optimal convergence rates for the parameter of interest and discuss the impact of measurement errors on these rates
Document type :
Complete list of metadata
Contributor : Abes Star :  Contact
Submitted on : Tuesday, August 24, 2021 - 2:51:11 PM
Last modification on : Wednesday, August 25, 2021 - 3:19:40 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03325213, version 1



Cabral Amilcar Chanang Tondji. Variable importance measures in semiparametric and high-dimensional models with or without error-in-variables. Statistics [math.ST]. Université Paris-Est, 2020. English. ⟨NNT : 2020PESC2042⟩. ⟨tel-03325213⟩



Record views


Files downloads