Sélection de variable : structure génétique d'une population et transmission de Plasmodium à travers le moustique.

Abstract : This thesis is concerned with variable selection in two practical problems. The first one is the identification of genetically homogeneous populations without prior information on the target population. The structure of interest may be contained in only a subset of available genetic markers. We propose a model selection procedure to simultaneously solve the two-fold problem of selection of the number of populations and the relevant subset of variable. The models in competition are compared using penalized maximum likelihood criteria. Under weak assumptions on the penalty function, we proved the consistency of the selection procedure. We also proposed a new penalty function with an associated non-asymptotic oracle inequality. In practice, this result suggests a penalty function defined up to a multiplicative parameter which is calibrated thanks to the slope heuristics. Using simulated data, we found that the calibration of the penalty term improves the performances of the selection procedure with respect to classical asymptotic criteria such as AIC and BIC. In addition, we proposed a stand alone C++ package implementing our proposed selection procedure. The second problem is motivated by malaria control strategies aiming at reducing disease transmission intensity. The data we have at hand are described by variables of different types. In addition their number is of the order of the sample size. We considered a variable selection procedure based on the variable importances from random forests to face the variable selection problem. The selected variables are assessed in Zero Inflated Negative Binomial model.
