High-dimensional vector quantization : convergence rates and variable selection

Abstract : The distortion of the quantizer built from a n-sample of a probability distribution over a vector space with the famous k-means algorithm is firstly studied in this thesis report. To be more precise, this report aims to give oracle inequalities on the difference between the distortion of the k-means quantizer and the minimum distortion achievable by a k-point quantizer, where the influence of the natural parameters of the quantization issue should be precisely described. For instance, some natural parameters are the distribution support, the size k of the quantizer set of images, the dimension of the underlying Euclidean space, and the sample size n. After a brief summary of the previous works on this topic, an equivalence between the conditions previously stated for the excess distortion to decrease fast with respect to the sample size and a technical condition is stated, in the continuous density case. Interestingly, this condition looks like a technical condition required in statistical learning to achieve fast rates of convergence. Then, it is proved that the excess distortion achieves a fast convergence rate of 1/n in expectation, provided that this technical condition is satisfied. Next, a so-called margin condition is introduced, which is easier to understand, and it is established that this margin condition implies the technical condition mentioned above. Some examples of distributions satisfying this margin condition are exposed, such as the Gaussian mixtures, which are classical distributions in the clustering framework. Then, provided that this margin condition is satisfied, an oracle inequality on the excess distortion of the k-means quantizer is given. This convergence result shows that the excess distortion decreases with a rate 1/n and depends on natural geometric properties of the probability distribution with respect to the size of the set of images k. Suprisingly the dimension of the underlying Euclidean space seems to play no role in the convergence rate of the distortion. Following the latter point, the results are directly extended to the case where the underlying space is a Hilbert space, which is the adapted framework when dealing with curve quantization. However, high-dimensional quantization often needs in practical a dimension reduction step, before proceeding to a quantization algorithm. This motivates the following study of a variable selection procedure adapted to the quantization issue. To be more precise, a Lasso type procedure adapted to the quantization framework is studied. The Lasso type penalty applies to the set of image points of the quantizer, in order to obtain sparse image points. The outcome of this procedure is called the Lasso k-means quantizer, and some theoretical results on this quantizer are established, under the margin condition introduced above. First it is proved that the image points of such a quantizer are close to the image points of a sparse quantizer, achieving a kind of tradeoff between excess distortion and size of the support of image points. Then an oracle inequality on the excess distortion of the Lasso k-means quantizer is given, providing a convergence rate of 1/n^(1/2) in expectation. Moreover, the dependency of this convergence rate on different other parameters is precisely described. These theoretical predictions are illustrated with numerical experimentations, showing that the Lasso k-means procedure mainly behaves as expected. However, the numerical experimentations also shed light on some drawbacks concerning the practical implementation of such an algorithm.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-01126851
Contributor : Abes Star <>
Submitted on : Friday, March 6, 2015 - 10:15:03 PM
Last modification on : Friday, May 17, 2019 - 10:48:43 AM
Long-term archiving on : Sunday, June 7, 2015 - 5:55:20 PM

Identifiers

  • HAL Id : tel-01126851, version 1

Citation

Clément Levrard. High-dimensional vector quantization : convergence rates and variable selection. Statistics [math.ST]. Université Paris Sud - Paris XI, 2014. English. ⟨NNT : 2014PA112214⟩. ⟨tel-01126851⟩

Share

Metrics

Record views

449

Files downloads

354