Skip to Main content Skip to Navigation
Theses

Machine Learning and Big Data for outlier detection, and applications

Abstract : The problems of outliers detection and robust regression in a high-dimensional setting are fundamental in statistics, and have numerous applications.Following a recent set of works providing methods for simultaneous robust regression and outliers detection,we consider in a first part a model of linear regression with individual intercepts, in a high-dimensional setting.We introduce a new procedure for simultaneous estimation of the linear regression coefficients and intercepts, using two dedicated sorted-l1 convex penalizations, also called SLOPE.We develop a complete theory for this problem: first, we provide sharp upper bounds on the statistical estimation error of both the vector of individual intercepts and regression coefficients.Second, we give an asymptotic control on the False Discovery Rate (FDR) and statistical power for support selection of the individual intercepts.Numerical illustrations, with a comparison to recent alternative approaches, are provided on both simulated and several real-world datasets.Our second part is motivated by a genetic problem. Among some particular DNA sequences called multi-satellites, which are indicators of the development or colorectal cancer tumors, we want to find the sequences that have a much higher (resp. much lower) rate of mutation than expected by biologist experts. This problem leads to a non-linear probabilistic model and thus goes beyond the scope of the first part. In this second part we thus consider some generalized linear models with individual intercepts added to the linear predictor, and explore the statistical properties of a new procedure for simultaneous estimation of the regression coefficients and intercepts, using again the sorted-l1 penalization. We focus in this part only on the low-dimensional case and are again interested in the performance of our procedure in terms of statistical estimation error and FDR.
Complete list of metadatas

Cited literature [94 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02976485
Contributor : Abes Star :  Contact
Submitted on : Friday, October 23, 2020 - 2:35:12 PM
Last modification on : Tuesday, October 27, 2020 - 4:34:06 PM

File

73717_VIROULEAU_2020_archivage...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02976485, version 1

Collections

Citation

Alain Virouleau. Machine Learning and Big Data for outlier detection, and applications. Statistics [math.ST]. Institut Polytechnique de Paris, 2020. English. ⟨NNT : 2020IPPAX028⟩. ⟨tel-02976485⟩

Share

Metrics

Record views

72

Files downloads

18