qui intègre de manière heuristique la pertinence thématique des pages, des textes d'ancrage et des contextes des liens. Là où les auteurs jugent une page pertinente si sa similarité par rapport à quelques mots clés est supérieure à zéro (Hersovici et coll, 1998. ,
Adaptive on-line page importance computation, Proceedings of the twelfth international conference on World Wide Web , WWW '03, pp.280-290, 2003. ,
DOI : 10.1145/775152.775192
Personalizing PageRank for word sense disambiguation, Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics on, EACL '09, pp.33-41, 2009. ,
DOI : 10.3115/1609067.1609070
University of surrey participation in trec 8 : Weirdness indexing for logical document extrapolation and retrieval (wilter) Dans The Eighth Text REtrieval Conference (TREC-8), 1999. ,
Metricc : Harnessing comparable corpora for multilingual lexicon development, Proceedings of the 15th EURALEX International Congress, pp.389-403, 2012. ,
URL : https://hal.archives-ouvertes.fr/halshs-00725224
Learning to aggregate vertical results into web search results, Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM '11, 2011. ,
DOI : 10.1145/2063576.2063611
Sources of evidence for vertical selection, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pp.315-322, 2009. ,
DOI : 10.1145/1571941.1571997
Document selection methodologies for efficient and effective learning-to-rank, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pp.468-475, 2009. ,
DOI : 10.1145/1571941.1572022
Focused crawling with scalable ordinal regression solvers, Proceedings of the 24th international conference on Machine learning, ICML '07, pp.57-64, 2007. ,
DOI : 10.1145/1273496.1273504
Crawling a country, Special interest tracks and posters of the 14th international conference on World Wide Web , WWW '05, pp.864-872, 2005. ,
DOI : 10.1145/1062745.1062768
Modern information retrieval, 1999. ,
Scaling to very very large corpora for natural language disambiguation, Proceedings of the 39th Annual Meeting on Association for Computational Linguistics , ACL '01, pp.26-33, 2001. ,
DOI : 10.3115/1073012.1073017
Random sampling from a search engine's index, Proceedings of the 15th international conference on the World Wide Web, p.367, 2006. ,
BootCaT : Bootstrapping Corpora and Terms from the Web, Proceedings of the LREC 2004 conference, pp.1313-1316, 2004. ,
The WaCky wide web: a collection of very large linguistically processed web-crawled corpora, Language Resources and Evaluation, vol.10, issue.4, pp.209-226, 2009. ,
DOI : 10.1007/s10579-009-9081-4
Cleaneval : a competition for cleaning web pages, Proceedings of the Conference on Language Resources and Evaluation (LREC), 2008. ,
Building general-and special-purpose corpora by web crawling, Proceedings of the 13th NIJL international symposium, language corpora : Their compilation and application, pp.31-40, 2006. ,
MAR- VIN, multi-agent softbot to retrieve multilingual medical information on the Web. Medical informatics, 1998. ,
Purely urlbased topic classification, Proceedings of the 18th international conference on World wide web, WWW '09, pp.1109-1110, 2009. ,
The deep web : Surfacing hidden value. The journal of electronic publishing, 2001. ,
Focused crawls, tunneling , and digital libraries. Lecture notes in computer science, p.26, 2002. ,
DOI : 10.1007/3-540-45747-x_7
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.6604
A technique for measuring the relative size and overlap of public Web search engines, Computer Networks and ISDN Systems, vol.30, issue.1-7, pp.1-12, 1998. ,
DOI : 10.1016/S0169-7552(98)00127-5
Random walk term weighting for information retrieval, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, pp.829-830, 2007. ,
DOI : 10.1145/1277741.1277930
Large language models in machine translation, Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp.858-867, 2007. ,
Keeping up with the changing Web, Computer, vol.33, issue.5, pp.52-58, 2000. ,
DOI : 10.1109/2.841784
Extracting patterns and relations from the world wide web. The World Wide Web and Databases, pp.172-183, 1999. ,
Identifying and Filtering Near-Duplicate Documents, Combinatorial Pattern Matching, pp.1-10, 2000. ,
DOI : 10.1007/3-540-45123-4_1
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.365.5357
Watson : Anticipating and contextualizing information needs, Proceedings of the Annual Meeting-American Society for Information Science, pp.727-740, 1999. ,
Crawling Towards Eternity : Building an Archive of the World Wide Web, 1997. ,
A survey of Web clustering engines, ACM Computing Surveys, vol.41, issue.3, pp.17-31, 2009. ,
DOI : 10.1145/1541880.1541884
Effective web crawling, Thèse de doctorat. Cité page 21, 2005. ,
DOI : 10.1145/1067268.1067287
Focused crawling: a new approach to topic-specific Web resource discovery, Computer Networks, vol.31, issue.11-16, pp.11-161623, 1999. ,
DOI : 10.1016/S1389-1286(99)00052-3
Accelerated focused crawling through online relevance feedback, Proceedings of the eleventh international conference on World Wide Web , WWW '02, pp.148-159, 2002. ,
DOI : 10.1145/511446.511466
Yahoo ! learning to rank challenge overview, Journal of Machine Learning Research-Proceedings Track, vol.14, issue.107, pp.1-24, 2011. ,
Contextual information portals, Proceedings of AAAI Spring Symposium. Cité page 25, 2010. ,
Parallel crawlers, Proceedings of the eleventh international conference on World Wide Web , WWW '02, pp.124-135, 2002. ,
DOI : 10.1145/511446.511464
Effective page refresh policies for Web crawlers, ACM Transactions on Database Systems, vol.28, issue.4, pp.390-426, 2003. ,
DOI : 10.1145/958942.958945
Estimating frequency of change, ACM Transactions on Internet Technology, vol.3, issue.3, pp.256-290, 2003. ,
DOI : 10.1145/857166.857170
Efficient crawling through URL ordering, Computer Networks and ISDN Systems, vol.30, issue.1-7, pp.161-172, 1998. ,
DOI : 10.1016/S0169-7552(98)00108-1
Web page classification. Foundations and Advances in Data Mining, pp.221-274, 2005. ,
Overview of the trec, 2009. ,
Optimal robot scheduling for Web search engines, Journal of Scheduling, vol.1, issue.1, 1997. ,
DOI : 10.1002/(SICI)1099-1425(199806)1:1<15::AID-JOS3>3.0.CO;2-K
URL : https://hal.archives-ouvertes.fr/inria-00073372
Predicting query performance, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '02, pp.299-306, 2002. ,
DOI : 10.1145/564376.564429
Information retrieval in the World-Wide Web: Making client-based searching feasible, Computer Networks and ISDN Systems, vol.27, issue.2, pp.183-192, 1994. ,
DOI : 10.1016/0169-7552(94)90132-5
Constitution automatique ou semi-automatique de lexiques thématiques en, 2010. ,
From federated to aggregated search, Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, pp.910-910, 2010. ,
DOI : 10.1145/1835449.1835682
Focused crawling using context graphs, Proceedings of the 26th International Conference on Very Large Data Bases, pp.527-534, 2000. ,
Learning to rank with partially-labeled data, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pp.251-258, 2008. ,
DOI : 10.1145/1390334.1390379
Rank aggregation methods for the Web, Proceedings of the tenth international conference on World Wide Web , WWW '01, pp.613-622, 2001. ,
DOI : 10.1145/371920.372165
An adaptive model for optimizing performance of an incremental web crawler, Proceedings of the tenth international conference on World Wide Web , WWW '01, 2001. ,
DOI : 10.1145/371920.371960
A lightweight and efficient tool for cleaning web pages, Proceedings of the Sixth International Language Resources and Evaluation Marrakech, Morocco. European Language Resources Association (ELRA). Cité page 136, 2008. ,
Liblinear : A library for large linear classification, The Journal of Machine Learning Research, vol.9, pp.1871-1874, 2008. ,
Authority Rankings from HITS, PageRank, and SALSA: Existence, Uniqueness, and Effect of Initialization, SIAM Journal on Scientific Computing, vol.27, issue.4, pp.1181-1201, 2006. ,
DOI : 10.1137/S1064827502412875
Overlaying graph links on treemaps, IEEE Symposium on Information Visualization Conference Compendium (demonstration), 2003. ,
URL : https://hal.archives-ouvertes.fr/hal-00875194
Introducing and evaluating ukwac, a very large web-derived corpus of english, Proceedings of the 4th Web as Corpus Workshop (WAC-4), pp.47-54, 2008. ,
Spam , Damn Spam , and Statistics Using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases : colocated with ACM SIGMOD/PODS, 2004. ,
Fact or fiction : Content classification for digital libraries, DELOS Workshop : Personalisation and Recommender Systems in Digital Libraries, p.80, 2001. ,
Concordancing the web with kwicfinder, Third North American Symposium on Corpus Linguistics and Language Teaching, pp.1-16, 2001. ,
Making the Web More Useful as a Source for Linguistic Corpora, Corpus Linguistics in North America, pp.191-205, 2003. ,
DOI : 10.1163/9789004333772_011
machine., The Annals of Statistics, vol.29, issue.5, pp.1189-1232, 2001. ,
DOI : 10.1214/aos/1013203451
Bagging gradientboosted trees for high precision, low variance ranking models, Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp.85-94, 2011. ,
Learning a monolingual language model from a multilingual text database, Proceedings of the ninth international conference on Information and knowledge management , CIKM '00, pp.187-193, 2000. ,
DOI : 10.1145/354756.354818
Building Minority Language Corpora by Learning to Generate Web Search Queries, Knowledge and Information Systems, vol.34, issue.1, pp.56-83, 2005. ,
DOI : 10.1023/A:1007545901558
Improving category specific Web search by learning query modifications, Proceedings 2001 Symposium on Applications and the Internet, pp.23-32, 2001. ,
DOI : 10.1109/SAINT.2001.905165
Guide focused crawler efficiently and effectively using on-line topical importance estimation, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, p.757, 2008. ,
DOI : 10.1145/1390334.1390488
Web spam taxonomy, First International Workshop on Adversarial Information Retrieval on the Web, pp.12-93, 2005. ,
The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, vol.24, issue.2, pp.8-12, 2009. ,
DOI : 10.1109/MIS.2009.36
Random-Walk Term Weighting for Improved Text Classification, Proceedings of the International Conference on Semantic Computing, pp.242-249, 2006. ,
Predicting the effectiveness of queries and retrieval systems, Thèse de doctorat, pp.50-51, 2010. ,
Topic-sensitive PageRank, Proceedings of the eleventh international conference on World Wide Web , WWW '02, pp.784-796, 2003. ,
DOI : 10.1145/511446.511513
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.5607
A study of the dirichlet priors for term frequency normalisation, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '05, pp.465-471, 2005. ,
DOI : 10.1145/1076034.1076114
Using coherence-based measures to predict query difficulty Advances in Information Retrieval, pp.689-694, 2008. ,
Large margin rank boundaries for ordinal regression, Advances in Neural Information Processing Systems, pp.115-132, 1999. ,
The shark-search algorithm. An application: tailored Web site mapping, Computer Networks and ISDN Systems, vol.30, issue.1-7, pp.317-326, 1998. ,
DOI : 10.1016/S0169-7552(98)00038-5
Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems, vol.20, issue.4, pp.422-446, 2002. ,
DOI : 10.1145/582415.582418
Predicting query difficulty on the web by learning visual clues, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '05, pp.615-616, 2005. ,
DOI : 10.1145/1076034.1076155
A ranking approach to keyphrase extraction, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pp.756-757, 2009. ,
DOI : 10.1145/1571941.1572113
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.4470
Methods of automatic term recognition: A review, Terminology International Journal of Theoretical and Applied Issues in Specialized Communication, vol.3, issue.2, pp.259-289, 1996. ,
DOI : 10.1075/term.3.2.03kag
Nutch : A flexible and scalable open-source web search engine, pp.90-91, 2004. ,
Introduction to the Special Issue on the Web as Corpus, Computational Linguistics, vol.19, issue.1, pp.333-347, 2003. ,
DOI : 10.1038/21987
Evaluation of cross-language information retrieval using the domain-specific girt data as parallel german-english corpus, Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC, pp.1343-1346, 2004. ,
The domain-specific task of clef -specific evaluation strategies in cross-language information retrieval. Cross- Language Information Retrieval and Evaluation, pp.48-56, 2001. ,
A Computer Method for Calculating Kendall's Tau with Ungrouped Data, Journal of the American Statistical Association, vol.14, issue.314, pp.61436-439, 1966. ,
DOI : 10.1080/01621459.1958.10501481
Boilerplate detection using shallow text features, Proceedings of the third ACM international conference on Web search and data mining, WSDM '10, pp.441-450, 2010. ,
DOI : 10.1145/1718487.1718542
On Information and Sufficiency, The Annals of Mathematical Statistics, vol.22, issue.1, pp.79-86, 1951. ,
DOI : 10.1214/aoms/1177729694
The Web as a graph, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , PODS '00, pp.1-10, 2000. ,
DOI : 10.1145/335168.335170
Document language models, query models, and risk minimization for information retrieval, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '01, pp.111-119, 2001. ,
DOI : 10.1145/383952.383970
A survey of eigenvector methods for web information retrieval. SIAM review, pp.135-161, 2005. ,
Learning to Rank for Information Retrieval and Natural Language Processing, Synthesis Lectures on Human Language Technologies, vol.4, issue.1, 2011. ,
DOI : 10.2200/S00348ED1V01Y201104HLT012
Building text classifiers using positive and unlabeled examples, Third IEEE International Conference on Data Mining, pp.179-186, 2003. ,
DOI : 10.1109/ICDM.2003.1250918
Learning to Rank for Information Retrieval, Foundations and Trends?? in Information Retrieval, vol.3, issue.3, pp.225-331, 2009. ,
DOI : 10.1561/1500000016
langid. py : An off-the-shelf language identification tool, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Demo Session, 2012. ,
Learning to model relatedness for news recommendation, Proceedings of the 20th international conference on World wide web, WWW '11, pp.57-66, 2011. ,
DOI : 10.1145/1963405.1963417
A random walks view of spectral segmentation, Artificial Intelligence and Statistics. Cité, p.75, 2001. ,
One-class svms for document classification, The Journal of Machine Learning Research, vol.2, pp.154-87, 2002. ,
Introduction to information retrieval, pp.11-41, 2008. ,
DOI : 10.1017/CBO9780511809071
Genres on the Web : Computational Models and Empirical Studies, 2010. ,
DOI : 10.1007/978-90-481-9178-9
Arachnid : Adaptive retrieval agents choosing heuristic neighborhoods for information discovery, Proceedings of the Fourteenth International Conference on Machine Learning, pp.227-235, 1997. ,
Topical web crawlers, ACM Transactions on Internet Technology, vol.4, issue.4, pp.26-111, 2003. ,
DOI : 10.1145/1031114.1031117
Interactive focused crawler : Setup, monitoring and control through user feedback, Mémoire de Master, K.R. School of Information Technology, 2004. ,
Linear feature-based models for information retrieval, Information Retrieval, vol.10, issue.3, pp.257-274, 2007. ,
DOI : 10.1007/s10791-006-9019-z
Textrank : Bringing order into texts, Proceedings of EMNLP, pp.404-411, 2004. ,
Contextual search and name disambiguation in email using graphs, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '06, pp.27-34, 2006. ,
DOI : 10.1145/1148170.1148179
Workshop on aggregated search, ACM SIGIR Forum, pp.80-83, 2008. ,
DOI : 10.1145/1480506.1480520
Analysis of weighted networks, Physical Review E, vol.70, issue.5, pp.56131-54, 2004. ,
DOI : 10.1103/PhysRevE.70.056131
Object-level ranking, Proceedings of the 14th international conference on World Wide Web , WWW '05, pp.567-574, 2005. ,
DOI : 10.1145/1060745.1060828
Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web , WWW '06, pp.83-92, 2006. ,
DOI : 10.1145/1135777.1135794
Web Crawling, Foundations and Trends?? in Information Retrieval, vol.4, issue.3, pp.175-246, 2010. ,
DOI : 10.1561/1500000017
Keyword spices : A new method for building domain-specific web search engines, International Joint Conference in Artificial Intelligence (IJCAI), pp.1457-1466, 2001. ,
The PageRank Citation Ranking : Bringing Order to the Web, p.55, 1999. ,
Crawl ordering by search impact, Proceedings of the international conference on Web search and web data mining , WSDM '08, pp.3-14, 2008. ,
DOI : 10.1145/1341531.1341535
Seeing stars, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , ACL '05, pp.115-97, 2005. ,
DOI : 10.3115/1219840.1219855
Myspiders : Evolve your own intelligent web crawlers. Autonomous agents and multi-agent systems, pp.221-229, 2002. ,
Topical crawling for business intelligence. Research and Advanced Technology for Digital Libraries, pp.233-244, 2003. ,
Learning to crawl, ACM Transactions on Information Systems, vol.23, issue.4, pp.430-462, 2005. ,
DOI : 10.1145/1095872.1095875
Link contexts in classifier-guided topical crawlers. Knowledge and Data Engineering, IEEE Transactions on, vol.18, issue.102, pp.107-122, 2006. ,
Removing Boilerplate and Duplicate Content from Web Corpora, Thèse de doctorat, 2011. ,
An algorithm for suffix stripping. Program : electronic library and information systems, pp.211-218, 1980. ,
Building domain-specific web collections for scientific digital libraries, Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries , JCDL '04, pp.135-141, 2004. ,
DOI : 10.1145/996350.996383
LETOR: A benchmark collection for research on learning to rank for information retrieval, Information Retrieval, vol.44, issue.2, pp.346-374, 2010. ,
DOI : 10.1007/s10791-009-9123-y
Quantify query ambiguity using ODP metadata, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, pp.697-698, 2007. ,
DOI : 10.1145/1277741.1277864
Random walks for text semantic similarity, Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, TextGraphs-4, 2009. ,
DOI : 10.3115/1708124.1708131
Efficient web spidering with reinforcement learning, Proceedings of the 16th international conference on Machine Learning, pp.24-93, 1999. ,
WebCorp: an integrated system for web text search, Language and Computers, vol.59, issue.1, pp.47-67, 2006. ,
DOI : 10.1163/9789401203791_005
The intelligent surfer : Probabilistic combination of link and content information in pagerank, Advances in Neural Information Processing Systems, pp.1441-1448, 2002. ,
Beyond PageRank, Proceedings of the 15th international conference on World Wide Web , WWW '06, pp.707-715, 2006. ,
DOI : 10.1145/1135777.1135881
Information Retrieval, pp.2-42, 1979. ,
Learning dictionaries for information extraction by multi-level bootstrapping, Proceedings of the National Conference on Artificial Intelligence, pp.474-479, 1999. ,
Term-weighting approaches in automatic text retrieval. Information processing & management, pp.513-523, 1988. ,
Large scale learning to rank, NIPS 2009 Workshop on Advances in Ranking. Cité, p.107, 2009. ,
Machine learning in automated text categorization, ACM Computing Surveys, vol.34, issue.1, pp.1-47, 2002. ,
DOI : 10.1145/505282.505283
Active learning literature survey. Rapport technique, 2010. ,
Creating general-purpose corpora using automated search engine queries, WaCky ! Working papers on the Web as Corpus, pp.63-98, 2006. ,
Federated Search, Foundations and Trends?? in Information Retrieval, vol.5, issue.1, pp.1-102, 2011. ,
DOI : 10.1561/1500000010
Using query context models to construct topical search engines, Proceeding of the third symposium on Information interaction in context, IIiX '10, pp.75-84, 2010. ,
DOI : 10.1145/1840784.1840797
Identifying ambiguous queries in web search, Proceedings of the 16th international conference on World Wide Web , WWW '07, pp.1169-1170, 2007. ,
DOI : 10.1145/1242572.1242749
Document structure meets page layout, Proceedings of the 10th ACM symposium on Document engineering, DocEng '10, pp.151-160, 2010. ,
DOI : 10.1145/1860559.1860590
URL : https://hal.archives-ouvertes.fr/hal-00637719
A General Evaluation Framework for Topical Crawlers, Information Retrieval, vol.52, issue.3, pp.417-447, 2005. ,
DOI : 10.1007/s10791-005-6993-5
Techniques for specialized search engines, Proceedings of Internet Computing, pp.25-28, 2001. ,
Quality and relevance of domain-specific search: A case study in mental health, Information Retrieval, vol.87, issue.4, pp.207-225, 2006. ,
DOI : 10.1007/s10791-006-7150-5
Quality-oriented search for depression portals Advances in Information Retrieval, pp.637-644, 2009. ,
Language-independent set expansion of named entities using the web. Dans Data Mining, Seventh IEEE International Conference on, pp.342-350, 2007. ,
You can't beat frequency (unless you use linguistic knowledge), Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL , ACL '06, pp.785-50, 2006. ,
DOI : 10.3115/1220175.1220274
Adapting boosting for information retrieval measures, Information Retrieval, vol.10, issue.3, pp.254-270, 2010. ,
DOI : 10.1007/s10791-009-9112-1
PEBL, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '02, pp.70-81, 2004. ,
DOI : 10.1145/775047.775083
Dual coordinate descent methods for logistic regression and maximum entropy models, Machine Learning, pp.41-75, 2011. ,
DOI : 10.1007/s10994-010-5221-8