Skip to Main content Skip to Navigation

Deriving Semantic Objects from the Structured Web

Marilena Oita 1
1 DBWeb
LTCI - Laboratoire Traitement et Communication de l'Information
Abstract : This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose. Next, in the context of dynamically-generated Web pages by content management systems, we present two keyword-based techniques that perform article extraction from such pages. Keywords, either automatically acquired through a Tf−Idf analysis, or extracted from Web feeds, guide the process of object identification, either at the level of a single Web page (SIGFEED algorithm), or across different pages sharing the same template (FOREST algorithm). We finally present, in the context of the deep Web, a generic framework which aims at discovering the semantic model of a Web object (here, data record) by, first, using FOREST for the extraction of objects, and second, by representing the implicit rdf:type similarities between the object attributes and the entity of the Web interface as relationships that, together with the instances extracted from the objects, form a labeled graph. This graph is further aligned to a generic ontology like YAGO for the discovery of the graph's unknown types and relations.
Document type :
Complete list of metadatas

Cited literature [134 references]  Display  Hide  Download
Contributor : Marilena Oita <>
Submitted on : Thursday, December 26, 2013 - 10:13:24 PM
Last modification on : Friday, July 31, 2020 - 10:44:09 AM
Long-term archiving on: : Friday, March 28, 2014 - 5:03:05 PM


  • HAL Id : tel-00922459, version 1



Marilena Oita. Deriving Semantic Objects from the Structured Web. Web. Telecom ParisTech, 2012. English. ⟨tel-00922459⟩



Record views


Files downloads