Revisiter le couplage traitement automatique des langues et recherche d'information

Fabienne Moreau 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Information retrieval systems (IRSs) aim at establishing a relationship between users' information needs and the information contained in documents. To this end, a commonly used method consists of making a simple match between query terms and document words. IRSs face two problems with such a mechanism. The first problem is related to polysemy : a single term may have different meanings and represent various concepts. The second and dual issue reects the fact that a single idea may be expressed in different forms. To overcome these limitations, a more natural solution is to perform a linguistic analysis of both documents and queries, using natural language processing (NLP) techniques. This allows one to consider each word as a single linguistic entity rather than as a simple string of characters, thus providing a more relevant document-query match. However, many previous studies that have tried to enrich IRSs with linguistic information have often resulted in disappointing unclear and and contradictory outputs. In order to better understand and improve upon these weak results, we propose a new approach for coupling NLP-IR. In contrast with other studies, we choose to fully exploit the richness of language by combining several levels of linguistic information : morphological, syntactic and semantic. To test the proposition of linking these various knowledges, we have designed a test platform which integrates them in parallel within the same IRSs ; this serves to demonstrate the clear and significant contribution of several types of information (especially morphological and semantic) and, via an original analysis of the correlations between the various linguistic index, it has highlighted some interesting cases of a complementary nature. Through a supervised machine-learning technique that merges the list of documents produced with each linguistic index, and automatically adapts its behavior to the query's characteristics, we prove how combining multilevel linguistic information can provide better overall results that are also far more stable than comparable tests. Finally, we propose a new method for the acquisition of morphological variants based on unsupervised learning techniques, which provides an even greater impact of this efficient knowledge on the performance of our IRS system. We show that by introducing more flexible tools that are better adapted to the constraints of IR, NLP can make a real contribution to this area.
Document type :
Theses
Complete list of metadatas

Cited literature [208 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00524514
Contributor : Patrick Gros <>
Submitted on : Friday, October 8, 2010 - 9:42:53 AM
Last modification on : Friday, November 16, 2018 - 1:21:49 AM
Long-term archiving on : Monday, January 10, 2011 - 11:39:19 AM

Identifiers

  • HAL Id : tel-00524514, version 1

Citation

Fabienne Moreau. Revisiter le couplage traitement automatique des langues et recherche d'information. Interface homme-machine [cs.HC]. Université Rennes 1, 2006. Français. ⟨tel-00524514⟩

Share

Metrics

Record views

373

Files downloads

850