Skip to Main content Skip to Navigation
Theses

Indexation et interrogation de chemins de lecture en contexte pour la recherche d'information structurée sur le web

Abstract : The growth of the Web gives new challenges in Information Retrieval (IR). Most of current systems are based on a re-use of traditional models, which have been developed for textual, atomic and independent documents and are not adapted to the Web. The Web structure is an essential aspect of the information description. Some approaches use this structure for IR, but most of them consider the whole set of links as a "bag-of-links", modelling the Web as a directed graph with HTML pages as nodes and hypertext links as edges, without taking into account the type of the links. The aim of our work is to take into account the links at both indexing and query time of a Structured Information Retrieval System (SIRS). The proposed IR model is based on a model of hyperdocuments in context, considering four facets of information description on the Web: the content, the hierarchical structure, the linear or non-linear reading paths and the context. A hyperdocument is modelled by a content (like for the structured documents), a set of reading paths and a context (accessible information space and referencing information space). A specific indexation process is proposed for each facet. The evaluation of our SmartWeb system shows the interest of the accessible information combined with the content. Then, we show the interest of an indexation of both "structured documents" and "reading paths", using several structured test collections automatically constructed. The model is also implemented in a full SIRS, showing the feasibility of our overall approach on the real Web. In particular, the links typing is one of the most important aspects of our model and is also the main difficulty of its implementation: we show that it is possible to extract a hierarchical structure from the Web and to identify different granularities of information.
Document type :
Theses
Complete list of metadatas

Cited literature [28 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00004453
Contributor : Thèses Imag <>
Submitted on : Tuesday, February 3, 2004 - 10:17:47 AM
Last modification on : Friday, November 6, 2020 - 4:05:47 AM
Long-term archiving on: : Friday, April 2, 2010 - 8:13:18 PM

Identifiers

  • HAL Id : tel-00004453, version 1

Collections

UJF | CNRS | IMAG | UGA

Citation

Mathias Géry. Indexation et interrogation de chemins de lecture en contexte pour la recherche d'information structurée sur le web. domain_stic.hype. Université Joseph-Fourier - Grenoble I, 2002. Français. ⟨tel-00004453⟩

Share

Metrics

Record views

395

Files downloads

658