Skip to Main content Skip to Navigation

Intelligent Content Acquisition in Web Archiving

Abstract : Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works in two phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web forms
Document type :
Complete list of metadata

Cited literature [137 references]  Display  Hide  Download
Contributor : MUHAMMAD Faheem Connect in order to contact the contributor
Submitted on : Friday, July 17, 2015 - 9:53:07 AM
Last modification on : Friday, July 31, 2020 - 10:44:09 AM
Long-term archiving on: : Sunday, October 18, 2015 - 10:59:05 AM


  • HAL Id : tel-01177622, version 1



Muhammad Faheem. Intelligent Content Acquisition in Web Archiving . Computer Science [cs]. TELECOM ParisTech, 2014. English. ⟨NNT : 2014-ENST-0084⟩. ⟨tel-01177622⟩



Record views


Files downloads