A regularized approach of instances x variables co-clustering for exploratory data analysis

Abstract : Co-clustering is a class of unsupervised data analysis techniques aiming at extracting the underlying dependency structure between the rows and columns of a data table in the form of homogeneous blocks, known as co-clusters. These techniques can be distinguished into those that aim at simultaneously clustering the instances and variables, and those that aim at clustering the values of two or more variables of a data set. Most of these techniques are limited to variables of the same type, and are hardly scalable to large data sets while providing easily interpretable clusters and co-clusters. Among the existing value based co-clustering approaches, MODL is suitable for processing large data sets with several numerical or categorical variables. In this thesis, we propose a value based approach, inspired by MODL, to perform a simultaneous clustering of the instances and variables of a data set with potentially mixed-type variables. The proposed co-clustering model provides a Maximum A Posteriori based summary of the data that can be used as it is for exploratory analysis of the data. When the summary is large, exploratory analysis tools, such as model coarsening, can be used to simplify the co-clustering which facilitates the interpretation of the results. We show that the proposed co-clustering approach can handle large data and extract easily interpretable clusters from mixed data with more than 10 millions observations. We also show the robustness of the approach, its capacity to extract inter-dependence between the variables, and its good behavior in extreme cases such as in the case of pattern-less data and in the case of perfectly correlated variables.
Complete list of metadatas

Cited literature [122 references]  Display  Hide  Download

https://hal.archives-ouvertes.fr/tel-01979698
Contributor : Aichetou Bouchareb <>
Submitted on : Sunday, January 13, 2019 - 6:52:48 PM
Last modification on : Wednesday, January 23, 2019 - 1:17:12 AM
Long-term archiving on: Sunday, April 14, 2019 - 12:58:25 PM

File

Manuscript_Thèse_Aichetou_Bou...
Files produced by the author(s)

Identifiers

  • HAL Id : tel-01979698, version 1

Collections

Citation

Aichetou Bouchareb. A regularized approach of instances x variables co-clustering for exploratory data analysis. Mathematics [math]. Université Paris 1 Panthéon-La Sorbonne, 2018. English. ⟨tel-01979698⟩

Share

Metrics

Record views

99

Files downloads

105