Skip to Main content Skip to Navigation

Statistical learning with high-cardinality string categorical variables

Abstract : Tabular data often contain columns with categorical variables, usually considered as non-numerical entries with a fixed and limited number of unique elements or categories. As many statistical learning algorithms require numerical representations of features, an encoding step is necessary to transform categorical entries into feature vectors, using for instance one-hot encoding. This and other similar strategies work well, in terms of prediction performance and interpretability, in standard statistical analysis when the number of categories is small. However, non-curated data give rise to string categorical variables with a very high cardinality and redundancy: the string entries share semantic and/or morphological information, and several entries can reflect the same entity. Without any data cleaning or feature engineering step, common encoding methods break down, as they tend to lose information in their vectorial representation. Also, they can create high-dimensional feature vectors, which prevent their usage in large scale settings. In this work, we study a series of categorical encodings that remove the need for preprocessing steps on high-cardinality string categorical variables. An ideal encoder should be: scalable to many categories; interpretable to end users; and capture the morphological information contained in the string entries. Experiments on real and simulated data show that the methods we propose improve supervised learning, are adapted to large-scale settings, and, in some cases, create feature vectors that are easily interpretable. Hence, they can be applied in Automated Machine Learning (AutoML) pipelines in the original string entries without any human intervention.
Document type :
Complete list of metadatas

Cited literature [141 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Wednesday, May 20, 2020 - 7:09:08 PM
Last modification on : Tuesday, May 26, 2020 - 9:12:09 AM


Version validated by the jury (STAR)


  • HAL Id : tel-02614322, version 1



Patricio Cerda Reyes. Statistical learning with high-cardinality string categorical variables. Machine Learning [cs.LG]. Université Paris-Saclay, 2019. English. ⟨NNT : 2019SACLS470⟩. ⟨tel-02614322⟩



Record views


Files downloads