Skip to Main content Skip to Navigation

Model-based clustering for categorical and mixed data sets

Matthieu Marbac-Lourdelle 1
1 MODAL - MOdel for Data Analysis and Learning
LPP - Laboratoire Paul Painlevé - UMR 8524, Université de Lille, Sciences et Technologies, Inria Lille - Nord Europe, METRICS - Evaluation des technologies de santé et des pratiques médicales - ULR 2694, Polytech Lille - École polytechnique universitaire de Lille
Abstract : RESUME : This work is our contribution to the cluster analysis of categorical and mixed data. The methods proposed in this manuscript modelize the data distribution in a probabilistic framework. When the data are categorical or mixed, the classical model assumes the independence between the variables conditionally on class. However, this approach is biased when the variables are intra-class correlated. The aim of this thesis is to study and to present some mixture models which relax the conditional independence assumption. Moreover, they have to summarize each class with few characteristic parameters. The first part of this manuscript is devoted to the cluster analysis of categorical data. The categorical variables are difficult to cluster since they leave the statistician facing with many combinatorial challenges. In this context, our contribution consists in two parsimonious mixture models which allow to cluster categorical data presenting intra-class dependencies. The main idea of these models is to group the variables into conditionally independent blocks. By setting specific distributions for these blocks, both models consider the intra-class dependencies between the variables. The first approach modelizes the block distribution by a mixture of two extreme dependency distributions while the second approach modelizes it by a multinomial distribution per modes. The study of the cluster analysis of mixed data sets is the second objective of this work. The challenge is due to the lack of classical distributions for mixed variables. Thus, we defined a probabilistic model respecting two main constraints. Firstly, the one-dimensional margin distributions of the components are classical for each variables. Secondly, the model characterizes the main intra-class dependencies. This model is defined as a mixture of Gaussian copulas. The Bayesian inference is performed via a Gibbs sampler. The classical information criteria (BIC, ICL) permit to perform the model selection.
Document type :
Complete list of metadata

Cited literature [2 references]  Display  Hide  Download
Contributor : Matthieu Marbac Connect in order to contact the contributor
Submitted on : Wednesday, October 22, 2014 - 10:05:43 AM
Last modification on : Friday, November 27, 2020 - 2:18:02 PM
Long-term archiving on: : Friday, January 23, 2015 - 10:20:13 AM


  • HAL Id : tel-01076418, version 1



Matthieu Marbac-Lourdelle. Model-based clustering for categorical and mixed data sets. Statistics [math.ST]. université lille 1, 2014. English. ⟨tel-01076418⟩



Les métriques sont temporairement indisponibles