Contributions de l'apprentissage statistique aux méthodes GLMM et LASSO: Application à la modélisation statistique de la morbidité liée au paludisme à Tori-Bossito (Bénin)

Abstract : The subject of this Thesis is the identification of environmental factors that may explain the variability of anopheline density at village and home scale and the determination malaria risk exposure in the study area. We consider these problems as variables selection and prediction problems in epidemiology context. Then, the main objective is the selection of an optimal subset of variables for the prediction of malaria risk exposure in the study area and also in an other area where the entomological data are not available. In the first part of the Thesis, we propose one method based on GLMM algorithm combined with a backward process for variables selection. Random effects are used at each hierarchy level of data for taking account the possible correlation because of the hierarchical structure of the data. This method provides an optimal subset of variables for prediction of malaria risk. But algorithm do not converge when some explanatory variables are too correlated or if data have a particular structure. For overcoming this, we propose in the second part an automatic machine learning method. We have generated automatically interactions between variables. The variables selection is performed by this automatic machine learning method based on Lasso and stratified two levels cross validation. Selected variables are debiased while the prediction is generated by simple GLM (Generalized linear model). The results of this method reveal to be qualitatively better, at selection, the prediction, and the CPU time point of view than those obtained in the first part. %In the third part of this work, we propose a second automatic machine learning method. %This method combines regression trees, random forest and stratified cross validation with two levels. %The minimum threshold of variables importance is accessed using the quadratic distance of variables importance while %the optimal subset of selected variables is used to perform predictions. %The results reveal to be qualitatively better, at the %selection, the prediction, %and the CPU time point of view than those obtained in the second part. Finally, the best subset of prediction contains : Season; interaction between Mean rainfall and openings; interaction between Rainy days before mission and Number of inhabitants; interaction between Rainy days during the mission and Vegetation.
Complete list of metadatas

Cited literature [105 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/tel-01736933
Contributor : Bienvenue Kouwaye <>
Submitted on : Sunday, March 18, 2018 - 11:48:37 PM
Last modification on : Wednesday, January 23, 2019 - 9:58:15 AM
Long-term archiving on : Tuesday, September 11, 2018 - 8:53:09 AM

File

KOUWAYE_these.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-01736933, version 1

Collections

Citation

Bienvenue Kouwaye. Contributions de l'apprentissage statistique aux méthodes GLMM et LASSO: Application à la modélisation statistique de la morbidité liée au paludisme à Tori-Bossito (Bénin). Statistiques [math.ST]. Université d'Abomey-Calavi (Bénin), 2018. Français. ⟨tel-01736933⟩

Share

Metrics

Record views

240

Files downloads

250