Skip to Main content Skip to Navigation

A New Co-similarity Measure : Application to Text Mining and Bioinformatics

Abstract : Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and there exist a multitude of different clustering algorithms for different settings. As datasets become larger and more varied, adaptations of existing algorithms are required to maintain the quality of clusters. In this regard, high-dimensional data poses some problems for traditional clustering algorithms known as 'the curse of dimensionality'. This thesis proposes a co-similarity based algorithm that is based on the concept of distributional semantics using higher-order co-occurrences, which are extracted from the given data. As opposed to co-clustering, where both instance and feature sets are hard clustered, co-similarity may be defined as a more 'soft' approach. The output of the algorithm is two similarity matrices - one for the objects and one for their features. Each of these similarity matrices exploits the similarity of the other, thereby implicitly taking advantage of a co-clustering style approach. Hence, with our method, it becomes possible to use any classical clustering method (k-means, Hierarchical clustering ...) to co-cluster data. We explore two applications of our co-similarity measure. In the case of text mining, document similarity is calculated based on word similarity, which in turn is calculated on the basis of document similarity. In this way, not only do we capture the similarity between documents coming from their common words but also the similarity coming from words that are not directly shared by the two documents but that can be considered to be similar. The second application is on gene expression datasets and is an example of co-clustering. We use our proposed method to extract gene clusters that show similar expression levels under a given condition from several cancer datasets (colon cancer, lung cancer, etc). The approach can also be extended to incorporate prior knowledge from a training dataset for the task of text categorization. Prior category labels coming from data in the training set can be used to influence similarity measures between features (words) to better classify incoming test datasets among the different categories. Thus, the same framework can be used for both clustering and categorization task depending on the amount of prior information available.
Document type :
Complete list of metadatas

Cited literature [15 references]  Display  Hide  Download
Contributor : Syed Fawad Hussain <>
Submitted on : Monday, October 11, 2010 - 4:50:20 PM
Last modification on : Friday, November 20, 2020 - 2:54:16 PM
Long-term archiving on: : Wednesday, January 12, 2011 - 2:59:06 AM


  • HAL Id : tel-00525366, version 1



Syed Fawad Hussain. A New Co-similarity Measure : Application to Text Mining and Bioinformatics. Computer Science [cs]. Institut National Polytechnique de Grenoble - INPG, 2010. English. ⟨tel-00525366⟩



Record views


Files downloads