HAL will be down for maintenance from Friday, June 10 at 4pm through Monday, June 13 at 9am. More information
Skip to Main content Skip to Navigation

Sketching for large-scale learning of mixture models

Abstract : Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. Furthermore, new challenges arise from modern database architectures, such as the requirements for learning methods to be amenable to streaming, parallel and distributed computing. In this context, an increasingly popular approach is to first compress the database into a representation called a linear sketch, that satisfies all the mentioned requirements, then learn the desired information using only this sketch, which can be significantly faster than using the full data if the sketch is small. In this thesis, we introduce a generic methodology to fit a mixture of probability distributions on the data, using only a sketch of the database. The sketch is defined by combining two notions from the reproducing kernel literature, namely kernel mean embedding and Random Features expansions. It is seen to correspond to linear measurements of the underlying probability distribution of the data, and the estimation problem is thus analyzed under the lens of Compressive Sensing (CS), in which a (traditionally finite-dimensional) signal is randomly measured and recovered. We extend CS results to our infinite-dimensional framework, give generic conditions for successful estimation and apply them analysis to many problems, with a focus on mixture models estimation. We base our method on the construction of random sketching operators such that some Restricted Isometry Property (RIP) condition holds in the Banach space of finite signed measures with high probability. In a second part we introduce a flexible heuristic greedy algorithm to estimate mixture models from a sketch. We apply it on synthetic and real data on three problems: the estimation of centroids from a sketch, for which it is seen to be significantly faster than k-means, Gaussian Mixture Model estimation, for which it is more efficient than Expectation-Maximization, and the estimation of mixtures of multivariate stable distributions, for which, to our knowledge, it is the only algorithm capable of performing such a task.
Complete list of metadata

Cited literature [207 references]  Display  Hide  Download

Contributor : Abes Star :  Contact
Submitted on : Thursday, January 18, 2018 - 5:15:07 PM
Last modification on : Friday, April 8, 2022 - 4:04:02 PM
Long-term archiving on: : Thursday, May 24, 2018 - 1:49:47 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01620815, version 2


Nicolas Keriven. Sketching for large-scale learning of mixture models. Machine Learning [stat.ML]. Université Rennes 1, 2017. English. ⟨NNT : 2017REN1S055⟩. ⟨tel-01620815v2⟩



Record views


Files downloads