Skip to Main content Skip to Navigation
Theses

Collecte orientée sur le Web pour la recherche d’information spécialisée

Abstract : Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithms and the user interface can take advantage of domain knowledge, such as domain objects or characteristics, to satisfy user information needs.In this thesis, we tackle the first inevitable step of a all topical search engine : focused document gathering from the Web. A thorough study of the state of art leads us to consider two strategies to gather topical documents from the Web: either relying on an existing search engine index (focused search) or directly crawling the Web (focused crawling).The first part of our research has been dedicated to focused search. In this context, a standard approach consists in combining domain-specific terms into queries, submitting those queries to a search engine and down- loading top ranked documents. After empirically evaluating this approach over 340 topics, we propose to enhance it in two different ways: Upstream of the search engine, we aim at formulating more relevant queries in or- der to increase the precision of the top retrieved documents. To do so, we define a metric based on a co-occurrence graph and a random walk algorithm, which aims at predicting the topical relevance of a query. Downstream of the search engine, we filter the retrieved documents in order to improve the document collection quality. We do so by modeling our gathering process as a tripartite graph and applying a random walk with restart algorithm so as to simultaneously order by relevance the documents and terms appearing in our corpus.In the second part of this thesis, we turn to focused crawling. We describe our focused crawler implementation that was designed to scale horizontally. Then, we consider the problem of crawl frontier ordering, which is at the very heart of a focused crawler. Such ordering strategy allows the crawler to prioritize its fetches, maximizing the number of in-domain documents retrieved while minimizing the non relevant ones. We propose to apply learning to rank algorithms to efficiently order the crawl frontier, and define a method to learn a ranking function from existing crawls.
Document type :
Theses
Complete list of metadata

Cited literature [146 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00853250
Contributor : Abes Star :  Contact
Submitted on : Thursday, August 22, 2013 - 11:32:10 AM
Last modification on : Monday, December 14, 2020 - 9:52:06 AM
Long-term archiving on: : Thursday, April 6, 2017 - 5:06:25 AM

File

VD2_DEGROC_CLEMENT_05062013.pd...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00853250, version 1

Collections

Citation

Clément de Groc. Collecte orientée sur le Web pour la recherche d’information spécialisée. Autre [cs.OH]. Université Paris Sud - Paris XI, 2013. Français. ⟨NNT : 2013PA112073⟩. ⟨tel-00853250⟩

Share

Metrics

Record views

1240

Files downloads

3695