Skip to Main content Skip to Navigation
Theses

Stability and selection of the number of groups in unsupervised clustering : application to the classification of triple negative breast cancers

Abstract : In this thesis, I treat the topic of classifying Triple Negative Breast Cancer (TNBC) tumors from a statistical point of view. After proposing a classification of TNBC based on proteins, I mainly focus on the use of cluster stability for selecting the number of groups in unsupervised clustering. Indeed, this is the method generally employed when classifying TNBC. The aim of this method is to obtain a classification that is robust, that is, easily replicable on similar data. This is measured by its sensibility to small changes, such as subsamplig of the dataset.Despite the popularity of this method, little is still known about how or when it works. For this reason, I propose two important methodological contributions, increasing the usability and interpretability of this method: (1) an R-package, clustRstab, that easily enables to estimate the stability of a clustering in different parameter settings. This package is accompanied by a simulation and an application study investigating when and how this method works. (2) A Modified version of the Adjusted Rand Index (ARI), a popular score for cluster comparisons which is a crucial step for estimating the stability of a clustering. I correct this score by basing it on a multinomial distribution hypothesis which enables it to take into account dependence between clusterings and conduct statistical inference. This Modified ARI (M ARI) is implemented in the R package texttt{aricode}.These two methods are then applied to a large cohort of TNBC tumors and the results are discussed in relation to earlier classification results of TNBC.
Complete list of metadata

https://tel.archives-ouvertes.fr/tel-03164674
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, March 10, 2021 - 10:13:10 AM
Last modification on : Tuesday, September 13, 2022 - 2:14:30 PM

File

92420_SUNDQVIST_2020_archivage...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-03164674, version 1

Citation

Martina Sundqvist. Stability and selection of the number of groups in unsupervised clustering : application to the classification of triple negative breast cancers. Cancer. Université Paris-Saclay, 2020. English. ⟨NNT : 2020UPASM026⟩. ⟨tel-03164674⟩

Share

Metrics

Record views

113

Files downloads

86