Generalized Cosine and Similarity Metrics: A Supervised Learning Approach based on Nearest Neighbors

Abstract : Almost all machine learning problems depend heavily on the metric used. Many works have proved that it is a far better approach to learn the metric structure from the data rather than assuming a simple geometry based on the identity matrix. This has paved the way for a new research theme called metric learning. Most of the works in this domain have based their approaches on distance learning only. However some other works have shown that similarity should be preferred over distance metrics while dealing with textual datasets as well as with non-textual ones. Being able to efficiently learn appropriate similarity measures, as opposed to distances, is thus of high importance for various collections. If several works have partially addressed this problem for different applications, no previous work is known which has fully addressed it in the context of learning similarity metrics for kNN classification. This is exactly the focus of the current study. In the case of information filtering systems where the aim is to filter an incoming stream of documents into a set of predefined topics with little supervision, cosine based category specific thresholds can be learned. Learning such thresholds can be seen as a first step towards learning a complete similarity measure. This strategy was used to develop Online and Batch algorithms for information filtering during the INFILE (Information Filtering) track of the CLEF (Cross Language Evaluation Forum) campaign during the years 2008 and 2009. However, provided enough supervised information is available, as is the case in classification settings, it is usually beneficial to learn a complete metric as opposed to learning thresholds. To this end, we developed numerous algorithms for learning complete similarity metrics for kNN classification. An unconstrained similarity learning algorithm called SiLA is developed in which case the normalization is independent of the similarity matrix. SiLA encompasses, among others, the standard cosine measure, as well as the Dice and Jaccard coefficients. SiLA is an extension of the voted perceptron algorithm and allows to learn different types of similarity functions (based on diagonal, symmetric or asymmetric matrices). We then compare SiLA with RELIEF, a well known feature re-weighting algorithm. It has recently been suggested by Sun and Wu that RELIEF can be seen as a distance metric learning algorithm optimizing a cost function which is an approximation of the 0-1 loss. We show here that this approximation is loose, and propose a stricter version closer to the the 0-1 loss, leading to a new, and better, RELIEF-based algorithm for classification. We then focus on a direct extension of the cosine similarity measure, defined as a normalized scalar product in a projected space. The associated algorithm is called generalized Cosine simiLarity Algorithm (gCosLA). All of the algorithms are tested on many different datasets. A statistical test, the s-test, is employed to assess whether the results are significantly different. gCosLA performed statistically much better than SiLA on many of the datasets. Furthermore, SiLA and gCosLA were compared with many state of the art algorithms, illustrating their well-foundedness.
Document type :
Theses
Computer Science [cs]. Université de Grenoble, 2010. English


https://tel.archives-ouvertes.fr/tel-00591988
Contributor : Ali Mustafa Qamar <>
Submitted on : Monday, June 6, 2011 - 12:56:00 PM
Last modification on : Monday, June 6, 2011 - 1:16:34 PM

Identifiers

  • HAL Id : tel-00591988, version 3

Collections

Citation

Ali Mustafa Qamar. Generalized Cosine and Similarity Metrics: A Supervised Learning Approach based on Nearest Neighbors. Computer Science [cs]. Université de Grenoble, 2010. English. <tel-00591988v3>

Export

Share

Metrics

Consultation de
la notice

672

Téléchargement du document

153