Skip to Main content Skip to Navigation

Toward Scalable Hierarchical Clustering and Co-clustering Methods : application to the Cluster Hypothesis in Information Retrieval

Abstract : As a major type of unsupervised machine learning method, clustering has been widely applied in various tasks. Different clustering methods have different characteristics. Hierarchical clustering, for example, is capable to output a binary tree-like structure, which explicitly illustrates the interconnections among data instances. Co-clustering, on the other hand, generates co-clusters, each containing a subset of data instances and a subset of data attributes. Applying clustering on textual data enables to organize input documents and reveal connections among documents. This characteristic is helpful in many cases, for example, in cluster-based Information Retrieval tasks. As the size of available data increases, demand of computing power increases. In response to this demand, many distributed computing platforms are developed. These platforms use the collective computing powers of commodity machines to parallelize data, assign computing tasks and perform computation concurrently.In this thesis, we first address text clustering tasks by proposing two clustering methods, Sim_AHC and SHCoClust. They respectively represent a similarity-based hierarchical clustering and a similarity-based hierarchical co-clustering. We examine their properties and performances through mathematical deduction, experimental verification and evaluation. Then we apply these methods in testing the cluster hypothesis, which is the fundamental assumption in cluster-based Information Retrieval. In such tests, we apply the optimal cluster search to evaluation the retrieval effectiveness of different clustering methods. We examine the computing efficiency and compare the results of the proposed tests. In order to perform clustering on larger datasets, we select Apache Spark platform and provide distributed implementation of Sim_AHC and of SHCoClust. For distributed Sim_AHC, we present the designed computing procedure, illustrate confronted difficulties and provide possible solutions. And for SHCoClust, we provide a distributed implementation of its core, spectral embedding. In this implementation, we use several datasets that vary in size to examine scalability.
Document type :
Complete list of metadatas

Cited literature [127 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Friday, September 20, 2019 - 4:03:08 PM
Last modification on : Thursday, April 30, 2020 - 10:12:03 AM
Long-term archiving on: : Sunday, February 9, 2020 - 1:54:35 AM


  • HAL Id : tel-02293176, version 1



Xinyu Wang. Toward Scalable Hierarchical Clustering and Co-clustering Methods : application to the Cluster Hypothesis in Information Retrieval. Technology for Human Learning. Université de Lyon, 2017. English. ⟨NNT : 2017LYSE2123⟩. ⟨tel-02293176⟩



Record views


Files downloads