Crawl intelligent et adaptatif d???applications web pour l???archivage du web, Ing??nierie des syst??mes d'information, vol.19, issue.4, pp.61-86, 2014. ,
DOI : 10.3166/isi.19.4.61-86
URL : https://hal.archives-ouvertes.fr/hal-01069818
Demonstrating intelligent crawling and archiving of web applications, Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, CIKM '13, 2013. ,
DOI : 10.1145/2505515.2508197
URL : https://hal.archives-ouvertes.fr/hal-00952006
Intelligent and Adaptive Crawling of Web Applications for Web Archiving, Proc. ICWE, 2013. ,
DOI : 10.1007/978-3-642-39200-9_26
URL : https://hal.archives-ouvertes.fr/hal-00874444
Intelligent crawling of Web applications for Web archiving, Proc. PhD Symposium of WWW, 2012. ,
URL : https://hal.archives-ouvertes.fr/hal-00874444
Une dèmonstration d'un crawler intelligent pour les applications Web, Proc. BDA Demonstration .Conference without formal proceedings, 2013. ,
Collecte intelligente et adaptative d'applications Web pour l'archivage du Web, Proc. BDA Conference without formal proceedings, 2013. ,
URL : https://hal.archives-ouvertes.fr/hal-00952133
OWET: A Comprehensive Toolkit for Wrapper Induction and Scalable Data Extraction ,
Adaptive Crawling Driven by Structure-Based Link Classification ,
DOI : 10.1007/978-3-319-27974-9_5
Adaptive geospatially focused crawling, Proceeding of the 18th ACM conference on Information and knowledge management, CIKM '09, 2009. ,
DOI : 10.1145/1645953.1646011
The connectivity sonar, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia , HYPERTEXT '03, 2003. ,
DOI : 10.1145/900051.900060
A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations, Data & Knowledge Engineering, vol.66, issue.2, pp.326-337, 2008. ,
DOI : 10.1016/j.datak.2008.04.003
Extracting structured data from Web pages, Proceedings of the 2003 ACM SIGMOD international conference on on Management of data , SIGMOD '03, 2003. ,
DOI : 10.1145/872757.872799
We knew the web was big, 2008. ,
Combining text and link analysis for focused crawling???An application for vertical search engines, Information Systems, vol.32, issue.6, pp.886-908, 2007. ,
DOI : 10.1016/j.is.2006.09.004
WebOQL: Restructuring documents, databases and Webs, Proceedings of the Fourteenth International Conference on Data Engineering, 1998. ,
The Kulturarw3 Project -The Royal Swedish Web Archiw3e -An example of " complete " collection of web pages, Proceedings of the 66th IFLA Council and General Conference, 2000. ,
Crawling programs for wrapper-based applications, 2008 IEEE International Conference on Information Reuse and Integration, 2008. ,
DOI : 10.1109/IRI.2008.4583023
UbiCrawler: a scalable fully distributed Web crawler, Software: Practice and Experience, vol.34, issue.8, pp.711-726, 2004. ,
DOI : 10.1002/spe.587
Highly efficient algorithms for structural clustering of large websites, Proceedings of the 20th international conference on World wide web, WWW '11, 2011. ,
DOI : 10.1145/1963405.1963468
The deep web: Surfacing hidden value, 2000. ,
Siphoning hidden-web data through keyword-based interfaces, Proceedings of the 19th Brazilian Symposium on Databases, 2004. ,
Searching for hidden-web databases, WebDB, 2005. ,
An adaptive crawler for locating hidden-Web entry points, WWW, 2007. ,
A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory , COLT '92, 1992. ,
DOI : 10.1145/130385.130401
The anatomy of a large-scale hypertextual Web search engine, WWW, 1998. ,
DOI : 10.1016/S0169-7552(98)00110-X
The first web page, amazingly, is lost, 2013. ,
UKWAC, D-Lib Magazine, vol.12, issue.1, 2006. ,
DOI : 10.1045/january2006-thompson
URL : http://doi.org/10.1045/january2006-thompson
Crawling Towards Eternity: Building an Archive of the World Wide Web, Web Techniques Magazine, 1997. ,
Crawling a country, Special interest tracks and posters of the 14th international conference on World Wide Web , WWW '05, 2005. ,
DOI : 10.1145/1062745.1062768
Do not crawl in the dust: Different urls with similar text, WWW, 2007. ,
The evolution of the web and implications for an incremental crawler, VLDB, 2000. ,
Automatic repairing of web wrappers, Proceeding of the third international workshop on Web information and data management , WIDM '01, 2001. ,
DOI : 10.1145/502932.502938
A Survey of Web Information Extraction Systems, IEEE Transactions on Knowledge and Data Engineering, vol.18, issue.10, pp.1411-1428, 2006. ,
DOI : 10.1109/TKDE.2006.152
A taxonomy of JavaScript redirection spam, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web , AIRWeb '07, 2007. ,
DOI : 10.1145/1244408.1244423
Text Processing with GATE (Version 6) GATE Roadrunner: Towards automatic data extraction from large web sites, CMM01] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo VLDB, 2001. ,
Roadrunner: automatic data extraction from data-intensive web sites, SIGMOD, 2002. ,
Fine-grain Web site structure discovery, WIDM, 2003. ,
Clustering Web pages based on their structure, Data & Knowledge Engineering, vol.54, issue.3, pp.279-299, 2005. ,
DOI : 10.1016/j.datak.2004.11.004
Blogs and the new politics of listening. The Political Quarterly, pp.272-280, 2008. ,
Focused crawling: a new approach to topic-specific Web resource discovery, Computer Networks, vol.31, issue.11-16, pp.11-161623, 1999. ,
DOI : 10.1016/S1389-1286(99)00052-3
Archiving the web: The pandora archive at the national library australia. National Library of Australia Staff Papers, 2009. ,
iRobot, Proceeding of the 17th international conference on World Wide Web , WWW '08, 2008. ,
DOI : 10.1145/1367497.1367558
Path sharing and predicate evaluation for high-performance XML filtering, ACM Transactions on Database Systems, vol.28, issue.4, pp.467-516, 2003. ,
DOI : 10.1145/958942.958947
Information retrieval in the World-Wide Web: Making client-based searching feasible, WWW, 1994. ,
DOI : 10.1016/0169-7552(94)90132-5
Focused crawling using context graphs [dK13] Maurice de Kunder. The indexed Web, VLDB, 2000. ,
SHARC, Proc. VLDB Endow, pp.586-597, 2009. ,
DOI : 10.14778/1687627.1687694
URL : https://hal.archives-ouvertes.fr/hal-01122670
Ontology-focused crawling of Web documents, Proceedings of the 2003 ACM symposium on Applied computing , SAC '03, 2003. ,
DOI : 10.1145/952532.952761
Cascading style sheets (CSS) snapshot 2007, 2008. ,
Automatic Wrapper Adaptation by Tree Edit Distance Matching, Combinations of Intelligent Methods and Applications, 2010. ,
DOI : 10.1007/978-3-642-19618-8_3
OXPath: A language for scalable, memory-efficient data extraction from Web applications, p.4, 2011. ,
DIADEM, Proceedings of the 21st international conference companion on World Wide Web, WWW '12 Companion, 2012. ,
DOI : 10.1145/2187980.2188025
A largescale study of the evolution of web pages, WWW, 2003. ,
Information extraction from HTML: Application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence, 1998. ,
A grammar inference algorithm for the world wide web, AAAI, 1996. ,
Internet encyclopaedias go head to head, Nature, vol.438, issue.7070, 2005. ,
DOI : 10.1038/438900a
Geographically focused collaborative crawling, Proceedings of the 15th international conference on World Wide Web , WWW '06, 2006. ,
DOI : 10.1145/1135777.1135822
Board Forum Crawling: A Web Crawling Method for Web Forum, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06), 2006. ,
DOI : 10.1109/WI.2006.52
In Search of the Lost Schema, ICDT, 1999. ,
DOI : 10.1007/3-540-49257-7_20
A Survey on Web Archiving Initiatives, Proceedings of the 15th International Conference on Theory and Practice of Digital Libraries: Research and Advanced Technology for Digital Libraries, 2011. ,
DOI : 10.1145/602421.602422
Web-scale information extraction with vertex, ICDE, 2011. ,
Scalable, generic, and adaptive systems for focused crawling, Proceedings of the 25th ACM conference on Hypertext and social media, HT '14, 2014. ,
DOI : 10.1145/2631775.2631795
URL : https://hal.archives-ouvertes.fr/hal-01069821
The volume and evolution of Web page templates The indexable web is more than 11.5 billion pages, WWW, 2005. External References [GS05] WWW, 2005. ,
API Blender: A uniform interface to social platform APIs, WWW, p.2012 ,
URL : https://hal.archives-ouvertes.fr/hal-00690621
XPath Formal Semantics and Beyond: a Coq based approach, TPHOLs, 2004. ,
The shark-search algorithm. An application: tailored Web site mapping, Computer Networks and ISDN Systems, vol.30, issue.1-7, pp.1-7317, 1998. ,
DOI : 10.1016/S0169-7552(98)00038-5
Mercator: A scalable, extensible web crawler, World Wide Web, vol.2, issue.4, pp.219-229, 1999. ,
DOI : 10.1023/A:1019213109274
THESUS: Organizing Web document collections based on link semantics, The VLDB Journal The International Journal on Very Large Data Bases, vol.12, issue.4, pp.320-332, 2003. ,
DOI : 10.1007/s00778-003-0100-6
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.15.349
IICA: An Ontology-based Internet Navigation System, AAAI, 1996. ,
Information and documentation ? WARC file format, 2009. ,
FoCUS, Proceedings of the 21st international conference companion on World Wide Web, WWW '12 Companion, 2013. ,
DOI : 10.1145/2187980.2187985
URL : https://hal.archives-ouvertes.fr/hal-01305623
Evolving strategies for focused web crawling, Proceedings of the 20th International Conference on Machine Learning, 2003. ,
Obama's victory tweet 'four more years' makes history. The Independent, 2012. ,
SVMs for the blogosphere: Blog identification and splog detection, AAAI, 2006. ,
Moining Web informative structures and contents based on entropy analysis, IEEE Trans. Knowl. Data Eng, 2004. ,
A longitudinal study of web pages continued: a consideration of document persistence, Inf. Res, vol.9, issue.2, 2003. ,
Visual OXPath: Robust Wrapping by Example, 2012. ,
Visual oXPath, Proceedings of the 21st international conference companion on World Wide Web, WWW '12 Companion ,
DOI : 10.1145/2187980.2188051
Regression testing for wrapper maintenance, AAAI, 1999. ,
Wrapper induction: Efficiency and expressiveness, Artificial Intelligence, vol.118, issue.1-2, pp.15-68, 2000. ,
DOI : 10.1016/S0004-3702(99)00100-9
Wrapper verification, World Wide Web, vol.3, issue.2, pp.79-94, 2000. ,
DOI : 10.1023/A:1019229612909
Wrapper induction for information extraction, IJCIA, 1997. ,
Automatic generation of agents for collecting hidden Web pages for data extraction, Data & Knowledge Engineering, vol.49, issue.2, pp.177-196, 2004. ,
DOI : 10.1016/j.datak.2003.10.003
Using HMM to learn user browsing patterns for focused Web crawling, Data & Knowledge Engineering, vol.59, issue.2, pp.270-291, 2006. ,
DOI : 10.1016/j.datak.2006.01.012
A rule-based query language for HTML, DASFAA, 2001. ,
Coarse-grained classification of web sites by their structural properties, Proceedings of the eighth ACM international workshop on Web information and data management , WIDM '06, 2006. ,
DOI : 10.1145/1183550.1183559
Classifying web sites, Proceedings of the 16th international conference on World Wide Web , WWW '07, 2007. ,
DOI : 10.1145/1242572.1242736
Irlbot: Scaling to 6 billion pages and beyond, ACM Trans. Web, vol.38, issue.3, pp.1-8, 2009. ,
Wrapper maintenance: A machine learning approach, J. Artificial Intelligence Research, 2003. ,
RecipeCrawler: Collecting Recipe Data from WWW Incrementally, Advances in Web-Age Information Management, 2006. ,
DOI : 10.1007/11775300_23
An automated change-detection algorithm for HTML documents based on semantic hierarchies, ICDE, 2001. External References [LNL04] Zehua Liu, Wee Keong Ng, and Ee-Peng Lim. An automated algorithm for extracting Website skeleton DASFAA, 2004. ,
A Lightweight Algorithm for Automated Forum Information Processing, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013. ,
DOI : 10.1109/WI-IAT.2013.18
Web archiving, 2006. ,
DOI : 10.1007/978-3-540-46332-0
Querying XML, 2006. ,
DOI : 10.1016/B978-155860711-8/50004-6
You've Got Dissent! Chinese Dissident Use of the Internet and Beijing's Counter Strategies, 2002. ,
Schema-guided wrapper maintenance for web-data extraction, Proceedings of the fifth ACM international workshop on Web information and data management , WIDM '03, 2003. ,
DOI : 10.1145/956699.956701
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.5673
Schema-guided wrapper maintenance for web-data extraction, Proceedings of the fifth ACM international workshop on Web information and data management , WIDM '03, 2003. ,
DOI : 10.1145/956699.956701
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.5673
Web-scale data integration: You can only afford to pay as you go, CIDR, 2007. ,
Google's Deep Web crawl, Proceedings of the VLDB Endowment, vol.1, issue.2, pp.1241-1252, 2008. ,
DOI : 10.14778/1454159.1454163
Introduction to heritrix, an archival quality web crawler, Proceedings of the 4th International Web Archiving Workshop, 2004. ,
A hierarchical approach to wrapper induction, Proceedings of the third annual conference on Autonomous Agents , AGENTS '99, 1999. ,
DOI : 10.1145/301136.301191
Evaluating topic-driven web crawlers, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '01, 2001. ,
DOI : 10.1145/383952.383995
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.9569
What's new on the web?, Proceedings of the 13th conference on World Wide Web , WWW '04, 2004. ,
DOI : 10.1145/988672.988674
High-Performance Web Crawling, Handbook of Massive Data Sets, pp.25-45, 2002. ,
DOI : 10.1002/1096-9128(200005)12:6<363::AID-CPE479>3.0.CO;2-3
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.2388
Reinventing Discovery: The New Era of Networked Science, 2011. ,
Quantitative analysis of user-generated content on the Web, WebEvolve, 2008. ,
An improved training algorithm for support vector machines, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop, 1997. ,
DOI : 10.1109/NNSP.1997.622408
Web crawling. Found. Trends Inf, Retr, vol.4, issue.3, pp.175-246, 2010. ,
DOI : 10.1561/1500000017
Recrawl scheduling based on information longevity, Proceeding of the 17th international conference on World Wide Web , WWW '08, 2008. ,
DOI : 10.1145/1367497.1367557
Patrick Siehndel, and Yannis Stavrakas. An architecture for selective web harvesting: The use case of heritrix, Proceedings of the 1st International Workshop on Archiving Community Memories, 2013. ,
WIC, VLDB, 2004. ,
DOI : 10.1016/B978-012088469-8.50034-6
H2RDF, Proceedings of the 21st international conference companion on World Wide Web, WWW '12 Companion, 2012. ,
DOI : 10.1145/2187980.2188058
URL : http://dspace.lib.ntua.gr/handle/123456789/36469
Yannis Stavrakas, and Pierre Senellart. Exploiting the social and semantic web for guided web archiving, Theory and Practice of Digital Libraries, pp.426-432, 2012. ,
HTML 4.01 specification, 1999. ,
WordPress completely dominates top 100 blogs, 2012. ,
Automatically maintaining wrappers for semi-structured web sources, Data & Knowledge Engineering, vol.61, issue.2, 2007. ,
DOI : 10.1016/j.datak.2006.06.006
Building light-weight wrappers for legacy Web data-sources using W4F, VLDB, 1999. ,
Data quality in web archiving, Proceedings of the 3rd workshop on Information credibility on the web, WICOW '09, 2009. ,
DOI : 10.1145/1526993.1526999
Declarative information extraction using Datalog with embedded extraction predicates, VLDB, 2007. ,
Exploring the Web with OXPath, LWDM, 2011. ,
Incremental crawling with Heritrix, IWAW, 2005. ,
Wraplet: Wrapping Your Web Contents with a Lightweight Language, 2007 Third International IEEE Conference on Signal-Image Technologies and Internet-Based System, 2007. ,
DOI : 10.1109/SITIS.2007.135
Learning information extraction rules for semistructured and free text, Machine Learning, pp.233-272, 1999. ,
Design and implementation of a highperformance distributed web crawler, Proceedings of the 18th International Conference on Data Engineering, pp.357-368, 2002. ,
On design of browser-oriented data extraction system and plug-ins, JMST, vol.18, 2010. ,
Focused crawling for both topical relevance and quality of medical information, Proceedings of the 14th ACM international conference on Information and knowledge management , CIKM '05, 2005. ,
DOI : 10.1145/1099554.1099583
Historical data not working, 2011. ,
GoGetIt!: a tool for generating structure-driven Web crawlers, WWW, 2006. ,
Structure-driven crawler generation by example, SIGIR, 2006. ,
XML Query (XQuery) Requirements. http://www.w3.org/TR/ xquery-requirements, 2007. ,
Data-rich section extraction from HTML pages, WISE, 2002. ,
Data extraction and label assignment for web databases, Proceedings of the twelfth international conference on World Wide Web , WWW '03, 2003. ,
DOI : 10.1145/775152.775179
Exploring traversal strategy for web forum crawling, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, 2008. ,
DOI : 10.1145/1390334.1390413
Automatic wrappers generation and maintenance, PACLIC, 2011. ,
An enhanced intelligent forum crawler, 2012 IEEE Symposium on Computational Intelligence for Security and Defence Applications, 2012. ,
DOI : 10.1109/CISDA.2012.6291523
An ontology-based approach to learnable focused crawling, Information Sciences, vol.178, issue.23, pp.4512-4522, 2008. ,
DOI : 10.1016/j.ins.2008.07.030
Web data extraction based on partial tree alignment, Proceedings of the 14th international conference on World Wide Web , WWW '05, 2005. ,
DOI : 10.1145/1060745.1060761
Joint optimization of wrapper generation and template detection, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '07, 2007. ,
DOI : 10.1145/1281192.1281287