Skip to Main content Skip to Navigation
Theses

Model-based clustering for categorical and mixed data sets

Matthieu Marbac-Lourdelle 1
1 MODAL - MOdel for Data Analysis and Learning
Inria Lille - Nord Europe, LPP - Laboratoire Paul Painlevé - UMR 8524, CERIM - Santé publique : épidémiologie et qualité des soins-EA 2694, Polytech Lille - École polytechnique universitaire de Lille, Université de Lille, Sciences et Technologies
Abstract : RESUME : This work is our contribution to the cluster analysis of categorical and mixed data. The methods proposed in this manuscript modelize the data distribution in a probabilistic framework. When the data are categorical or mixed, the classical model assumes the independence between the variables conditionally on class. However, this approach is biased when the variables are intra-class correlated. The aim of this thesis is to study and to present some mixture models which relax the conditional independence assumption. Moreover, they have to summarize each class with few characteristic parameters. The first part of this manuscript is devoted to the cluster analysis of categorical data. The categorical variables are difficult to cluster since they leave the statistician facing with many combinatorial challenges. In this context, our contribution consists in two parsimonious mixture models which allow to cluster categorical data presenting intra-class dependencies. The main idea of these models is to group the variables into conditionally independent blocks. By setting specific distributions for these blocks, both models consider the intra-class dependencies between the variables. The first approach modelizes the block distribution by a mixture of two extreme dependency distributions while the second approach modelizes it by a multinomial distribution per modes. The study of the cluster analysis of mixed data sets is the second objective of this work. The challenge is due to the lack of classical distributions for mixed variables. Thus, we defined a probabilistic model respecting two main constraints. Firstly, the one-dimensional margin distributions of the components are classical for each variables. Secondly, the model characterizes the main intra-class dependencies. This model is defined as a mixture of Gaussian copulas. The Bayesian inference is performed via a Gibbs sampler. The classical information criteria (BIC, ICL) permit to perform the model selection.
Complete list of metadatas

Cited literature [2 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01076418
Contributor : Matthieu Marbac <>
Submitted on : Wednesday, October 22, 2014 - 10:05:43 AM
Last modification on : Thursday, February 21, 2019 - 10:34:08 AM
Document(s) archivé(s) le : Friday, January 23, 2015 - 10:20:13 AM

Identifiers

  • HAL Id : tel-01076418, version 1

Collections

Citation

Matthieu Marbac-Lourdelle. Model-based clustering for categorical and mixed data sets. Statistics [math.ST]. université lille 1, 2014. English. ⟨tel-01076418⟩

Share

Metrics

Record views

886

Files downloads

997