Sparse and discriminative clustering for complex data. An application to cytology.

Camille Brunet 1
1 SIBI
IBISC - Informatique, Biologie Intégrative et Systèmes Complexes
Abstract : The main topics of this manuscript are sparsity and discrimination for modeling complex data. In a first part, we focus on the GMM context: we introduce a new family of probabilistic models which both clusters and finds a discriminative subspace chosen such as it best discriminates the groups. A family of 12 Discriminative Latent Mixture (DLM) models is introduced and is based on three ideas: firstly, the actual data live in a latent subspace with an intrinsic dimension lower than the dimension of the observed space; secondly, a subspace of K-1 dimensions is theoretically sufficient to discriminate K groups; thirdly, the observation space and the latent one are linked by a linear transformation. An estimation procedure, named Fisher-EM is proposed and improves, most of the time, clustering performances owing to the use of a discriminative subspace. As each axis, spanning the discriminative subspace, is a linear combination of all original variables, we therefore proposed 3 different methods based on a penalized criterion in order to ease the interpretation results. In particular, it allows to introduce sparsity directly in the loadings of the projection matrix which enables also to make variable selection for clustering. In a second part, we focus on the seriation context. We propose a dissimilarity measure based on a common neighborhood which allows to deal with noisy data and overlapping groups. A forward stepwise seriation algorithm, called the PB-Clus algorithm, is introduced and allows to obtain a block representation form of the data. This tool enables to reveal the intrinsic structure of data even in the case of noisy data, outliers, overlapping and non-Gaussian groups. Both methods have been validated on a biological application based on the cancer cell detection.
Document type :
Theses
Liste complète des métadonnées

https://tel.archives-ouvertes.fr/tel-00671333
Contributor : Camille Brunet <>
Submitted on : Friday, February 17, 2012 - 11:21:49 AM
Last modification on : Wednesday, January 23, 2019 - 1:48:04 PM
Document(s) archivé(s) le : Thursday, November 22, 2012 - 1:00:08 PM

Identifiers

  • HAL Id : tel-00671333, version 1

Collections

Citation

Camille Brunet. Sparse and discriminative clustering for complex data. An application to cytology.. Applications [stat.AP]. Université d'Evry-Val d'Essonne, 2011. English. ⟨tel-00671333⟩

Share

Metrics

Record views

613

Files downloads

463