Skip to Main content Skip to Navigation
Theses

Appariement de contenus textuels dans le domaine de la presse en ligne : Développement et adaptation d'un système de recherche d'information

Abstract : The goal of this thesis, conducted within an industrial framework, is to pair textual media content. Specifically, the aim is to pair on-line news articles to relevant videos for which we have a textual description. The main issue is then a matter of textual analysis, no image or spoken language analysis was undertaken in the present study. The question that arises is how to compare these particular objects, the texts, and also what criteria to use in order to estimate their degree of similarity. We consider that one of these criteria is the topic similarity of their content, in other words, the fact that two documents have to deal with the same topic to form a relevant pair. This problem fall within the field of Information Retrieval (IR) which is the main strategy called upon in this research. Furthermore, when dealing with news content, the time dimension is of prime importance. To address this aspect, the field of Topic Detection and Tracking (TDT) will also be explored. The pairing system developed in this thesis distinguishes different steps which complement one another. In the first step, the system uses Natural Language Processing (NLP) methods to index both articles and videos, in order to overcome the traditionnal bag-of-words representation of texts. In the second step, two scores are calculated for an article-video pair : the first one reflects their topical similarity and is based on a vector space model ; the second one expresses their proximity in time, based on an empirical function. At the end of the algorithm, a classification model learned from manually annotated document pairs is used to rank the results. Evaluation of the system’s performances raised some further questions in this doctoral research. The constraints imposed both by the data and the specific need of the partner company led us to adapt the evaluation protocol traditionnal used in IR, namely the Cranfield paradigm.We therefore propose an alternative solution for evaluating the system that takes all our constraints into account.
Complete list of metadatas

Cited literature [152 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01713076
Contributor : Adèle Désoyer <>
Submitted on : Tuesday, February 20, 2018 - 10:41:15 AM
Last modification on : Monday, October 19, 2020 - 11:13:02 AM
Long-term archiving on: : Monday, May 7, 2018 - 2:41:19 PM

File

manuscrit_ADesoyer_vfinale.pdf
Files produced by the author(s)

Identifiers

  • HAL Id : tel-01713076, version 1

Citation

Adèle Désoyer. Appariement de contenus textuels dans le domaine de la presse en ligne : Développement et adaptation d'un système de recherche d'information. Linguistique. Université Paris Nanterre, 2017. Français. ⟨tel-01713076⟩

Share

Metrics

Record views

153

Files downloads

442