Skip to Main content Skip to Navigation
Theses

Neural models for information retrieval : towards asymmetry sensitive approaches based on attention models

Thiziri Belkacem 1
1 IRIT-IRIS - Recherche d’Information et Synthèse d’Information
IRIT - Institut de recherche en informatique de Toulouse
Abstract : This work is situated in the context of information retrieval (IR) using machine learning (ML) and deep learning (DL) techniques. It concerns different tasks requiring text matching, such as ad-hoc research, question answering and paraphrase identification. The objective of this thesis is to propose new approaches, using DL methods, to construct semantic-based models for text matching, and to overcome the problems of vocabulary mismatch related to the classical bag of word (BoW) representations used in traditional IR models. Indeed, traditional text matching methods are based on the BoW representation, which considers a given text as a set of independent words. The process of matching two sequences of text is based on the exact matching between words. The main limitation of this approach is related to the vocabulary mismatch. This problem occurs when the text sequences to be matched do not use the same vocabulary, even if their subjects are related. For example, the query may contain several words that are not necessarily used in the documents of the collection, including relevant documents. BoW representations ignore several aspects about a text sequence, such as the structure the context of words. These characteristics are important and make it possible to differentiate between two texts that use the same words but expressing different information. Another problem in text matching is related to the length of documents. The relevant parts can be distributed in different ways in the documents of a collection. This is especially true in large documents that tend to cover a large number of topics and include variable vocabulary. A long document could thus contain several relevant passages that a matching model must capture. Unlike long documents, short documents are likely to be relevant to a specific subject and tend to contain a more restricted vocabulary. Assessing their relevance is in principle simpler than assessing the one of longer documents. In this thesis, we have proposed different contributions, each addressing one of the above-mentioned issues. First, in order to solve the problem of vocabulary mismatch, we used distributed representations of words (word embedding) to allow a semantic matching between the different words. These representations have been used in IR applications where document/query similarity is computed by comparing all the term vectors of the query with all the term vectors of the document, regardless. Unlike the models proposed in the state-of-the-art, we studied the impact of query terms regarding their presence/absence in a document. We have adopted different document/query matching strategies. The intuition is that the absence of the query terms in the relevant documents is in itself a useful aspect to be taken into account in the matching process. Indeed, these terms do not appear in documents of the collection for two possible reasons: either their synonyms have been used or they are not part of the context of the considered documents. The methods we have proposed make it possible, on the one hand, to perform an inaccurate matching between the document and the query, and on the other hand, to evaluate the impact of the different terms of a query in the matching process. Although the use of word embedding allows semantic-based matching between different text sequences, these representations combined with classical matching models still consider the text as a list of independent elements (bag of vectors instead of bag of words). However, the structure of the text as well as the order of the words is important. Any change in the structure of the text and/or the order of words alters the information expressed. In order to solve this problem, neural models were used in text matching.
Document type :
Theses
Complete list of metadata

Cited literature [275 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02499432
Contributor : Abes Star :  Contact
Submitted on : Thursday, March 5, 2020 - 11:36:31 AM
Last modification on : Thursday, June 10, 2021 - 3:08:01 AM
Long-term archiving on: : Saturday, June 6, 2020 - 2:27:41 PM

File

2019TOU30167b.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02499432, version 1

Citation

Thiziri Belkacem. Neural models for information retrieval : towards asymmetry sensitive approaches based on attention models. Information Retrieval [cs.IR]. Université Paul Sabatier - Toulouse III, 2019. English. ⟨NNT : 2019TOU30167⟩. ⟨tel-02499432⟩

Share

Metrics

Record views

190

Files downloads

1062