Modélisation d'un système de recherche d'information pour les systèmes hypertextes. Application à la recherche d'information sur le World Wide Web

Fernando Jorge Carvalho de Aguiar

Résumé

In a hypertext documents are seldom composed of a set of nodes instead of a single one. The information one page conveys might not be fully grasped if only the content of it is considered. The content of the pages with which the page being considered compose one document bear contextual information. Taking into account contextual information when indexing pages is fundamental to the quality of their index. Information retrieval systems for the Web, commonly known as Web search engines, should consider the splitting up of Web documents into several pages: one page should not be considered as a fully-fledged document, it is only a part of it. Therefore, when indexing a page one should consider its contextual information which is seldom located in its neighborhood. Traditionally, Web search engines consider pages as fully-fledged documents and their index are then built only from their contents. Contextual information is not considered. In this work we put forward a new information retrieval model for search engines running over Web sites. The cornerstone of it is a 2-level index for the pages composing the site: the bottom level is constructed solely from the content of the page itself, and the top level is constructed from the analysis of the contents of the pages which give a context to the page being indexed. We aim to improve the effectiveness of the search engine by improving the quality of the pages' index. The implementation of a search engine prototype integrating the model suggested and the use of the test collection WT10g issued from the TREC conferences and adapted to our needs, allowed us to carry out a large number of tests. The results of these tests showed an improvement of the effectiveness of the search engine prototype when compared with that of a search engine integrating a traditional model where contextual information is not used to index pages. Therefore, the tests unveiled evidence that contextual information might be worth considering when modelling a search engine.

Dans un hypertexte, un document est souvent composé de plusieurs nœuds et non pas d'un seul. L'information véhiculée par un nœud donné peut difficilement être appréhendée à travers la lecture du seul contenu de ce nœud, le contenu des autres nœuds qui composent un document avec le premier nœud lui apportent un contexte .La connaissance de ce contexte est fondamentale dans la compréhension de l'information véhiculée par le premier nœud. Un système de recherche d'information, ou plus couramment un moteur de recherche, appliqué au système hypertexte que constitue le Web devrait considérer dans son fonctionnement la fragmentation des documents hypertextuels en plusieurs pages : une page ne constitue pas un document à part entière, elle n'en est qu'une partie. Ainsi, pour bien indexer une page le contexte de l'information qu'elle véhicule doit être considéré. Les moteurs de recherche considèrent souvent une page comme un document et l'indexent en analysant uniquement son contenu. Le contexte des pages est ignoré. Dans ce travail nous proposons un modèle de recherche d'information pour un moteur de recherche appliqué à un système hypertexte constitué par un site Web. Ce modèle repose sur la construction d'un index à deux niveaux pour chacune des pages du site : un premier niveau, niveau inférieur, construit à partir du seul contenu de la page, et un deuxième niveau, niveau supérieur, construit à partir du contenu des pages qui apportent un contexte au contenu de la page en train d'être indexée. En améliorant la qualité des index des pages on cherche à améliorer l'efficacité du moteur de recherche. Grâce à l'implémentation d'un prototype de moteur de recherche intégrant le modèle proposé ainsi que l'utilisation de la collection de tests WT10g issue des conférences TREC et adaptée à nos besoins, nous avons pu mener des expérimentations. Les résultats de ces dernières, une amélioration dans la qualité des réponses retournées par le moteur prototype, sont des indicateurs favorables de l'utilité de l'information contextuelle des pages. L'efficacité du moteur prototype a été comparée avec celle d'un moteur de recherche adoptant un modèle traditionnel où un seul niveau d'index, celui issu du seul contenu des pages, est utilisé.

A new model for hypertext information retrieval system : Application to world wide web information retrieval

Modélisation d'un système de recherche d'information pour les systèmes hypertextes. Application à la recherche d'information sur le World Wide Web

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager