Skip to Main content Skip to Navigation

Von Mises-Fisher based (co-)clustering for high-dimensional sparse data : application to text and collaborative filtering data

Abstract : Cluster analysis or clustering, which aims to group together similar objects, is undoubtedly a very powerful unsupervised learning technique. With the growing amount of available data, clustering is increasingly gaining in importance in various areas of data science for several reasons such as automatic summarization, dimensionality reduction, visualization, outlier detection, speed up research engines, organization of huge data sets, etc. Existing clustering approaches are, however, severely challenged by the high dimensionality and extreme sparsity of the data sets arising in some current areas of interest, such as Collaborative Filtering (CF) and text mining. Such data often consists of thousands of features and more than 95% of zero entries. In addition to being high dimensional and sparse, the data sets encountered in the aforementioned domains are also directional in nature. In fact, several previous studies have empirically demonstrated that directional measures—that measure the distance between objects relative to the angle between them—, such as the cosine similarity, are substantially superior to other measures such as Euclidean distortions, for clustering text documents or assessing the similarities between users/items in CF. This suggests that in such context only the direction of a data vector (e.g., text document) is relevant, not its magnitude. It is worth noting that the cosine similarity is exactly the scalar product between unit length data vectors, i.e., L 2 normalized vectors. Thus, from a probabilistic perspective using the cosine similarity is equivalent to assuming that the data are directional data distributed on the surface of a unit-hypersphere. Despite the substantial empirical evidence that certain high dimensional sparse data sets, such as those encountered in the above domains, are better modeled as directional data, most existing models in text mining and CF are based on popular assumptions such as Gaussian, Multinomial or Bernoulli which are inadequate for L 2 normalized data. In this thesis, we focus on the two challenging tasks of text document clustering and item recommendation, which are still attracting a lot of attention in the domains of text mining and CF, respectively. In order to address the above limitations, we propose a suite of new models and algorithms which rely on the von Mises-Fisher (vMF) assumption that arises naturally for directional data lying on a unit-hypersphere.
Document type :
Complete list of metadata

Cited literature [188 references]  Display  Hide  Download
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, July 11, 2018 - 4:00:06 PM
Last modification on : Saturday, June 19, 2021 - 3:49:27 AM
Long-term archiving on: : Saturday, October 13, 2018 - 8:21:27 AM


Version validated by the jury (STAR)


  • HAL Id : tel-01835699, version 1


Aghiles Salah. Von Mises-Fisher based (co-)clustering for high-dimensional sparse data : application to text and collaborative filtering data. Information Retrieval [cs.IR]. Université Sorbonne Paris Cité, 2016. English. ⟨NNT : 2016USPCB093⟩. ⟨tel-01835699⟩



Record views


Files downloads