Skip to Main content Skip to Navigation
Theses

Understanding the Hidden Web

Pierre Senellart 1
1 GEMO - Integration of data and knowledge distributed over the web
LRI - Laboratoire de Recherche en Informatique, UP11 - Université Paris-Sud - Paris 11, Inria Saclay - Ile de France, CNRS - Centre National de la Recherche Scientifique : UMR8623
Abstract : The hidden Web (also known as deep or invisible Web), that is, the part of the Web not directly accessible through hyperlinks, but through HTML forms or Web services, is of great value, but difficult to exploit. We discuss a process for the fully automatic discovery, syntactic and semantic analysis, and querying of hidden-Web services. We propose first a general architecture that relies on a semi-structured warehouse of imprecise (probabilistic) content. We provide a detailed complexity analysis of the underlying probabilistic tree model. We describe how we can use a combination of heuristics and probing to understand the structure of an HTML form. We present an original use of a supervised machine-learning method, namely conditional random fields, in an unsupervised manner, on an automatic, imperfect, and imprecise, annotation based on domain knowledge, in order to extract relevant information from HTML result pages. So as to obtain semantic relations between inputs and outputs of a hidden-Web service, we investigate the complexity of deriving a schema mapping between database instances, solely relying on the presence of constants in the two instances. We finally describe a model for the semantic representation and intensional indexing of hidden-Web sources, and discuss how to process a user's high-level query using such descriptions.
Document type :
Theses
Complete list of metadatas

Cited literature [9 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00198150
Contributor : Pierre Senellart <>
Submitted on : Monday, December 17, 2007 - 12:15:37 AM
Last modification on : Wednesday, October 14, 2020 - 4:00:29 AM
Long-term archiving on: : Thursday, September 27, 2012 - 11:30:18 AM

Identifiers

  • HAL Id : tel-00198150, version 1

Collections

Citation

Pierre Senellart. Understanding the Hidden Web. Human-Computer Interaction [cs.HC]. Université Paris Sud - Paris XI, 2007. English. ⟨tel-00198150⟩

Share

Metrics

Record views

777

Files downloads

1084