. Une-variante-de-la-stratégie-shark-search-de-hersovici, qui intègre de manière heuristique la pertinence thématique des pages, des textes d'ancrage et des contextes des liens. Là où les auteurs jugent une page pertinente si sa similarité par rapport à quelques mots clés est supérieure à zéro (Hersovici et coll, 1998.

S. Abiteboul, M. Preda, and G. Et-cobena, Adaptive on-line page importance computation, Proceedings of the twelfth international conference on World Wide Web , WWW '03, pp.280-290, 2003.
DOI : 10.1145/775152.775192

E. Agirre and A. Et-soroa, Personalizing PageRank for word sense disambiguation, Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics on, EACL '09, pp.33-41, 2009.
DOI : 10.3115/1609067.1609070

K. Ahmad, L. Gillam, and L. Et-tostevin, University of surrey participation in trec 8 : Weirdness indexing for logical document extrapolation and retrieval (wilter) Dans The Eighth Text REtrieval Conference (TREC-8), 1999.

A. Alonso, H. Blancafort, C. De-groc, C. Million, and G. Williams, Metricc : Harnessing comparable corpora for multilingual lexicon development, Proceedings of the 15th EURALEX International Congress, pp.389-403, 2012.
URL : https://hal.archives-ouvertes.fr/halshs-00725224

J. Arguello, F. Diaz, and J. Et-callan, Learning to aggregate vertical results into web search results, Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM '11, 2011.
DOI : 10.1145/2063576.2063611

J. Arguello, F. Diaz, J. Callan, and J. Et-crespo, Sources of evidence for vertical selection, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pp.315-322, 2009.
DOI : 10.1145/1571941.1571997

J. Aslam, E. Kanoulas, V. Pavlu, S. Savev, and E. Et-yilmaz, Document selection methodologies for efficient and effective learning-to-rank, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pp.468-475, 2009.
DOI : 10.1145/1571941.1572022

R. Babaria, J. Nath, C. Bhattacharyya, and M. Murty, Focused crawling with scalable ordinal regression solvers, Proceedings of the 24th international conference on Machine learning, ICML '07, pp.57-64, 2007.
DOI : 10.1145/1273496.1273504

R. Baeza-yates, C. Castillo, M. Marin, and A. Et-rodriguez, Crawling a country, Special interest tracks and posters of the 14th international conference on World Wide Web , WWW '05, pp.864-872, 2005.
DOI : 10.1145/1062745.1062768

R. Baeza-yates and B. Et-ribeiro-neto, Modern information retrieval, 1999.

M. Banko and E. Et-brill, Scaling to very very large corpora for natural language disambiguation, Proceedings of the 39th Annual Meeting on Association for Computational Linguistics , ACL '01, pp.26-33, 2001.
DOI : 10.3115/1073012.1073017

Z. Bar-yossef and M. Et-gurevich, Random sampling from a search engine's index, Proceedings of the 15th international conference on the World Wide Web, p.367, 2006.

M. Baroni and S. Et-bernardini, BootCaT : Bootstrapping Corpora and Terms from the Web, Proceedings of the LREC 2004 conference, pp.1313-1316, 2004.

M. Baroni, S. Bernardini, A. Ferraresi, and E. Et-zanchetta, The WaCky wide web: a collection of very large linguistically processed web-crawled corpora, Language Resources and Evaluation, vol.10, issue.4, pp.209-226, 2009.
DOI : 10.1007/s10579-009-9081-4

M. Baroni, F. Chantree, A. Kilgarriff, and S. Et-sharoff, Cleaneval : a competition for cleaning web pages, Proceedings of the Conference on Language Resources and Evaluation (LREC), 2008.

M. Baroni and M. Et-ueyama, Building general-and special-purpose corpora by web crawling, Proceedings of the 13th NIJL international symposium, language corpora : Their compilation and application, pp.31-40, 2006.

O. Baujard, V. Baujard, S. Aurel, C. Boyer, and R. Et-appel, MAR- VIN, multi-agent softbot to retrieve multilingual medical information on the Web. Medical informatics, 1998.

E. Baykan, M. Henzinger, L. Marian, and I. Et-weber, Purely urlbased topic classification, Proceedings of the 18th international conference on World wide web, WWW '09, pp.1109-1110, 2009.

M. Bergman, The deep web : Surfacing hidden value. The journal of electronic publishing, 2001.

D. Bergmark, C. Lagoze, and A. Et-sbityakov, Focused crawls, tunneling , and digital libraries. Lecture notes in computer science, p.26, 2002.
DOI : 10.1007/3-540-45747-x_7

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.6604

K. Bharat and A. Et-broder, A technique for measuring the relative size and overlap of public Web search engines, Computer Networks and ISDN Systems, vol.30, issue.1-7, pp.1-12, 1998.
DOI : 10.1016/S0169-7552(98)00127-5

R. Blanco and C. Et-lioma, Random walk term weighting for information retrieval, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, pp.829-830, 2007.
DOI : 10.1145/1277741.1277930

T. Brants, A. Popat, P. Xu, F. Och, and J. Et-dean, Large language models in machine translation, Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp.858-867, 2007.

B. Brewington and G. Et-cybenko, Keeping up with the changing Web, Computer, vol.33, issue.5, pp.52-58, 2000.
DOI : 10.1109/2.841784

S. Brin, Extracting patterns and relations from the world wide web. The World Wide Web and Databases, pp.172-183, 1999.

A. Broder, Identifying and Filtering Near-Duplicate Documents, Combinatorial Pattern Matching, pp.1-10, 2000.
DOI : 10.1007/3-540-45123-4_1

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.365.5357

J. Budzik and K. Et-hammond, Watson : Anticipating and contextualizing information needs, Proceedings of the Annual Meeting-American Society for Information Science, pp.727-740, 1999.

M. Burner, Crawling Towards Eternity : Building an Archive of the World Wide Web, 1997.

C. Carpineto, S. Osí-nski, G. Romano, and D. Et-weiss, A survey of Web clustering engines, ACM Computing Surveys, vol.41, issue.3, pp.17-31, 2009.
DOI : 10.1145/1541880.1541884

C. Castillo, Effective web crawling, Thèse de doctorat. Cité page 21, 2005.
DOI : 10.1145/1067268.1067287

S. Chakrabarti, M. V. Den-berg, and B. Et-dom, Focused crawling: a new approach to topic-specific Web resource discovery, Computer Networks, vol.31, issue.11-16, pp.11-161623, 1999.
DOI : 10.1016/S1389-1286(99)00052-3

S. Chakrabarti, K. Punera, and M. Et-subramanyam, Accelerated focused crawling through online relevance feedback, Proceedings of the eleventh international conference on World Wide Web , WWW '02, pp.148-159, 2002.
DOI : 10.1145/511446.511466

O. Chapelle and Y. Chang, Yahoo ! learning to rank challenge overview, Journal of Machine Learning Research-Proceedings Track, vol.14, issue.107, pp.1-24, 2011.

J. Chen, T. Karthik, and L. Et-subramanian, Contextual information portals, Proceedings of AAAI Spring Symposium. Cité page 25, 2010.

J. Cho and H. Garcia-molina, Parallel crawlers, Proceedings of the eleventh international conference on World Wide Web , WWW '02, pp.124-135, 2002.
DOI : 10.1145/511446.511464

J. Cho and H. Garcia-molina, Effective page refresh policies for Web crawlers, ACM Transactions on Database Systems, vol.28, issue.4, pp.390-426, 2003.
DOI : 10.1145/958942.958945

J. Cho and H. Garcia-molina, Estimating frequency of change, ACM Transactions on Internet Technology, vol.3, issue.3, pp.256-290, 2003.
DOI : 10.1145/857166.857170

J. Cho, H. Garcia-molina, and L. Et-page, Efficient crawling through URL ordering, Computer Networks and ISDN Systems, vol.30, issue.1-7, pp.161-172, 1998.
DOI : 10.1016/S0169-7552(98)00108-1

B. Choi and Z. Et-yao, Web page classification. Foundations and Advances in Data Mining, pp.221-274, 2005.

C. L. Clarke, N. Craswell, and I. Et-soboroff, Overview of the trec, 2009.

E. Coffman, Z. Liu, and R. Et-weber, Optimal robot scheduling for Web search engines, Journal of Scheduling, vol.1, issue.1, 1997.
DOI : 10.1002/(SICI)1099-1425(199806)1:1<15::AID-JOS3>3.0.CO;2-K

URL : https://hal.archives-ouvertes.fr/inria-00073372

S. Cronen-townsend, Y. Zhou, and W. Et-croft, Predicting query performance, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '02, pp.299-306, 2002.
DOI : 10.1145/564376.564429

D. Bra, P. Et-post, and R. , Information retrieval in the World-Wide Web: Making client-based searching feasible, Computer Networks and ISDN Systems, vol.27, issue.2, pp.183-192, 1994.
DOI : 10.1016/0169-7552(94)90132-5

C. De-groc, Constitution automatique ou semi-automatique de lexiques thématiques en, 2010.

F. Diaz, M. Lalmas, and M. Et-shokouhi, From federated to aggregated search, Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '10, pp.910-910, 2010.
DOI : 10.1145/1835449.1835682

M. Diligenti, F. Coetzee, S. Lawrence, C. Giles, and M. Et-gori, Focused crawling using context graphs, Proceedings of the 26th International Conference on Very Large Data Bases, pp.527-534, 2000.

K. Duh and K. Et-kirchhoff, Learning to rank with partially-labeled data, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, pp.251-258, 2008.
DOI : 10.1145/1390334.1390379

C. Dwork, R. Kumar, M. Naor, and D. Et-sivakumar, Rank aggregation methods for the Web, Proceedings of the tenth international conference on World Wide Web , WWW '01, pp.613-622, 2001.
DOI : 10.1145/371920.372165

J. Edwards, K. Mccurley, and J. Et-tomlin, An adaptive model for optimizing performance of an incremental web crawler, Proceedings of the tenth international conference on World Wide Web , WWW '01, 2001.
DOI : 10.1145/371920.371960

S. Evert, A lightweight and efficient tool for cleaning web pages, Proceedings of the Sixth International Language Resources and Evaluation Marrakech, Morocco. European Language Resources Association (ELRA). Cité page 136, 2008.

R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Et-lin, Liblinear : A library for large linear classification, The Journal of Machine Learning Research, vol.9, pp.1871-1874, 2008.

A. Farahat, T. Lofaro, J. Miller, G. Rae, and L. Ward, Authority Rankings from HITS, PageRank, and SALSA: Existence, Uniqueness, and Effect of Initialization, SIAM Journal on Scientific Computing, vol.27, issue.4, pp.1181-1201, 2006.
DOI : 10.1137/S1064827502412875

J. Fekete, D. Wang, N. Dang, A. Aris, and C. Et-plaisant, Overlaying graph links on treemaps, IEEE Symposium on Information Visualization Conference Compendium (demonstration), 2003.
URL : https://hal.archives-ouvertes.fr/hal-00875194

A. Ferraresi, E. Zanchetta, M. Baroni, and S. Et-bernardini, Introducing and evaluating ukwac, a very large web-derived corpus of english, Proceedings of the 4th Web as Corpus Workshop (WAC-4), pp.47-54, 2008.

D. Fetterly, M. Manasse, and M. Et-najork, Spam , Damn Spam , and Statistics Using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases : colocated with ACM SIGMOD/PODS, 2004.

A. Finn, N. Kushmerick, and B. Smyth, Fact or fiction : Content classification for digital libraries, DELOS Workshop : Personalisation and Recommender Systems in Digital Libraries, p.80, 2001.

W. Fletcher, Concordancing the web with kwicfinder, Third North American Symposium on Corpus Linguistics and Language Teaching, pp.1-16, 2001.

W. H. Fletcher, Making the Web More Useful as a Source for Linguistic Corpora, Corpus Linguistics in North America, pp.191-205, 2003.
DOI : 10.1163/9789004333772_011

J. H. Friedman, machine., The Annals of Statistics, vol.29, issue.5, pp.1189-1232, 2001.
DOI : 10.1214/aos/1013203451

Y. Ganjisaffar, R. Caruana, and C. V. Et-lopes, Bagging gradientboosted trees for high precision, low variance ranking models, Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp.85-94, 2011.

R. Ghani and R. Et-jones, Learning a monolingual language model from a multilingual text database, Proceedings of the ninth international conference on Information and knowledge management , CIKM '00, pp.187-193, 2000.
DOI : 10.1145/354756.354818

R. Ghani, R. Jones, and D. Et-mladenic, Building Minority Language Corpora by Learning to Generate Web Search Queries, Knowledge and Information Systems, vol.34, issue.1, pp.56-83, 2005.
DOI : 10.1023/A:1007545901558

E. Glover, G. Flake, S. Lawrence, W. Birmingham, A. Kruger et al., Improving category specific Web search by learning query modifications, Proceedings 2001 Symposium on Applications and the Internet, pp.23-32, 2001.
DOI : 10.1109/SAINT.2001.905165

Z. Guan, C. Wang, C. Chen, J. Bu, and J. Et-wang, Guide focused crawler efficiently and effectively using on-line topical importance estimation, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '08, p.757, 2008.
DOI : 10.1145/1390334.1390488

Z. Gyöngyi and H. Garcia-molina, Web spam taxonomy, First International Workshop on Adversarial Information Retrieval on the Web, pp.12-93, 2005.

A. Halevy, P. Norvig, and F. Et-pereira, The Unreasonable Effectiveness of Data, IEEE Intelligent Systems, vol.24, issue.2, pp.8-12, 2009.
DOI : 10.1109/MIS.2009.36

S. Hassan, R. Mihalcea, and C. Et-banea, Random-Walk Term Weighting for Improved Text Classification, Proceedings of the International Conference on Semantic Computing, pp.242-249, 2006.

C. Hauff, Predicting the effectiveness of queries and retrieval systems, Thèse de doctorat, pp.50-51, 2010.

T. Haveliwala, Topic-sensitive PageRank, Proceedings of the eleventh international conference on World Wide Web , WWW '02, pp.784-796, 2003.
DOI : 10.1145/511446.511513

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.5607

B. He and I. Et-ounis, A study of the dirichlet priors for term frequency normalisation, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '05, pp.465-471, 2005.
DOI : 10.1145/1076034.1076114

J. He, M. Larson, and M. Et-de-rijke, Using coherence-based measures to predict query difficulty Advances in Information Retrieval, pp.689-694, 2008.

R. Herbrich, T. Graepel, and K. Et-obermayer, Large margin rank boundaries for ordinal regression, Advances in Neural Information Processing Systems, pp.115-132, 1999.

M. Hersovici, M. Jacovi, Y. Maarek, D. Pelleg, M. Shtalhaim et al., The shark-search algorithm. An application: tailored Web site mapping, Computer Networks and ISDN Systems, vol.30, issue.1-7, pp.317-326, 1998.
DOI : 10.1016/S0169-7552(98)00038-5

K. Järvelin and J. Et-kekäläinen, Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems, vol.20, issue.4, pp.422-446, 2002.
DOI : 10.1145/582415.582418

E. Jensen, S. Beitzel, D. Grossman, O. Frieder, and A. Et-chowdhury, Predicting query difficulty on the web by learning visual clues, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '05, pp.615-616, 2005.
DOI : 10.1145/1076034.1076155

X. Jiang, Y. Hu, and H. Li, A ranking approach to keyphrase extraction, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pp.756-757, 2009.
DOI : 10.1145/1571941.1572113

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.159.4470

K. Kageura and B. Et-umino, Methods of automatic term recognition: A review, Terminology International Journal of Theoretical and Applied Issues in Specialized Communication, vol.3, issue.2, pp.259-289, 1996.
DOI : 10.1075/term.3.2.03kag

R. Khare, D. Cutting, K. Sitaker, and A. Et-rifkin, Nutch : A flexible and scalable open-source web search engine, pp.90-91, 2004.

A. Kilgarriff and G. Et-grefenstette, Introduction to the Special Issue on the Web as Corpus, Computational Linguistics, vol.19, issue.1, pp.333-347, 2003.
DOI : 10.1038/21987

M. Kluck, Evaluation of cross-language information retrieval using the domain-specific girt data as parallel german-english corpus, Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC, pp.1343-1346, 2004.

M. Kluck and F. Et-gey, The domain-specific task of clef -specific evaluation strategies in cross-language information retrieval. Cross- Language Information Retrieval and Evaluation, pp.48-56, 2001.

W. R. Knight, A Computer Method for Calculating Kendall's Tau with Ungrouped Data, Journal of the American Statistical Association, vol.14, issue.314, pp.61436-439, 1966.
DOI : 10.1080/01621459.1958.10501481

C. Kohlschütter, P. Fankhauser, and W. Et-nejdl, Boilerplate detection using shallow text features, Proceedings of the third ACM international conference on Web search and data mining, WSDM '10, pp.441-450, 2010.
DOI : 10.1145/1718487.1718542

S. Kullback and R. Et-leibler, On Information and Sufficiency, The Annals of Mathematical Statistics, vol.22, issue.1, pp.79-86, 1951.
DOI : 10.1214/aoms/1177729694

R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tompkins et al., The Web as a graph, Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , PODS '00, pp.1-10, 2000.
DOI : 10.1145/335168.335170

J. Lafferty and C. Et-zhai, Document language models, query models, and risk minimization for information retrieval, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '01, pp.111-119, 2001.
DOI : 10.1145/383952.383970

A. Langville and C. Meyer, A survey of eigenvector methods for web information retrieval. SIAM review, pp.135-161, 2005.

H. Li, Learning to Rank for Information Retrieval and Natural Language Processing, Synthesis Lectures on Human Language Technologies, vol.4, issue.1, 2011.
DOI : 10.2200/S00348ED1V01Y201104HLT012

B. Liu, Y. Dai, X. Li, W. Lee, and P. Et-yu, Building text classifiers using positive and unlabeled examples, Third IEEE International Conference on Data Mining, pp.179-186, 2003.
DOI : 10.1109/ICDM.2003.1250918

T. Liu, Learning to Rank for Information Retrieval, Foundations and Trends?? in Information Retrieval, vol.3, issue.3, pp.225-331, 2009.
DOI : 10.1561/1500000016

M. Lui and T. Et-baldwin, langid. py : An off-the-shelf language identification tool, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Demo Session, 2012.

Y. Lv, T. Moon, P. Kolari, Z. Zheng, X. Wang et al., Learning to model relatedness for news recommendation, Proceedings of the 20th international conference on World wide web, WWW '11, pp.57-66, 2011.
DOI : 10.1145/1963405.1963417

M. Maila and J. Et-shi, A random walks view of spectral segmentation, Artificial Intelligence and Statistics. Cité, p.75, 2001.

L. Manevitz and M. Et-yousef, One-class svms for document classification, The Journal of Machine Learning Research, vol.2, pp.154-87, 2002.

C. Manning, P. Raghavan, and H. Et-schutze, Introduction to information retrieval, pp.11-41, 2008.
DOI : 10.1017/CBO9780511809071

A. Mehler, S. Sharoff, and M. Et-santini, Genres on the Web : Computational Models and Empirical Studies, 2010.
DOI : 10.1007/978-90-481-9178-9

F. Menczer, Arachnid : Adaptive retrieval agents choosing heuristic neighborhoods for information discovery, Proceedings of the Fourteenth International Conference on Machine Learning, pp.227-235, 1997.

F. Menczer, G. Pant, and P. Et-srinivasan, Topical web crawlers, ACM Transactions on Internet Technology, vol.4, issue.4, pp.26-111, 2003.
DOI : 10.1145/1031114.1031117

R. Menezes, Interactive focused crawler : Setup, monitoring and control through user feedback, Mémoire de Master, K.R. School of Information Technology, 2004.

D. Metzler and W. B. Et-croft, Linear feature-based models for information retrieval, Information Retrieval, vol.10, issue.3, pp.257-274, 2007.
DOI : 10.1007/s10791-006-9019-z

R. Mihalcea and P. Et-tarau, Textrank : Bringing order into texts, Proceedings of EMNLP, pp.404-411, 2004.

E. Minkov, W. Cohen, and A. Ng, Contextual search and name disambiguation in email using graphs, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval , SIGIR '06, pp.27-34, 2006.
DOI : 10.1145/1148170.1148179

V. Murdock and M. Et-lalmas, Workshop on aggregated search, ACM SIGIR Forum, pp.80-83, 2008.
DOI : 10.1145/1480506.1480520

M. Newman, Analysis of weighted networks, Physical Review E, vol.70, issue.5, pp.56131-54, 2004.
DOI : 10.1103/PhysRevE.70.056131

Z. Nie, Y. Zhang, J. Wen, and W. Ma, Object-level ranking, Proceedings of the 14th international conference on World Wide Web , WWW '05, pp.567-574, 2005.
DOI : 10.1145/1060745.1060828

A. Ntoulas, M. Najork, and M. Et-manasse, Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web , WWW '06, pp.83-92, 2006.
DOI : 10.1145/1135777.1135794

C. Olston and M. Et-najork, Web Crawling, Foundations and Trends?? in Information Retrieval, vol.4, issue.3, pp.175-246, 2010.
DOI : 10.1561/1500000017

S. Oyama, T. Kokubo, T. Ishida, T. Yamada, and Y. Et-kitamura, Keyword spices : A new method for building domain-specific web search engines, International Joint Conference in Artificial Intelligence (IJCAI), pp.1457-1466, 2001.

L. Page, S. Brin, R. Motwani, and T. Et-winograd, The PageRank Citation Ranking : Bringing Order to the Web, p.55, 1999.

S. Pandey and C. Et-olston, Crawl ordering by search impact, Proceedings of the international conference on Web search and web data mining , WSDM '08, pp.3-14, 2008.
DOI : 10.1145/1341531.1341535

B. Pang and L. Et-lee, Seeing stars, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics , ACL '05, pp.115-97, 2005.
DOI : 10.3115/1219840.1219855

G. Pant and F. Et-menczer, Myspiders : Evolve your own intelligent web crawlers. Autonomous agents and multi-agent systems, pp.221-229, 2002.

G. Pant and F. Et-menczer, Topical crawling for business intelligence. Research and Advanced Technology for Digital Libraries, pp.233-244, 2003.

G. Pant and P. Et-srinivasan, Learning to crawl, ACM Transactions on Information Systems, vol.23, issue.4, pp.430-462, 2005.
DOI : 10.1145/1095872.1095875

G. Pant and P. Et-srinivasan, Link contexts in classifier-guided topical crawlers. Knowledge and Data Engineering, IEEE Transactions on, vol.18, issue.102, pp.107-122, 2006.

J. Pomikálek, Removing Boilerplate and Duplicate Content from Web Corpora, Thèse de doctorat, 2011.

M. Porter, An algorithm for suffix stripping. Program : electronic library and information systems, pp.211-218, 1980.

J. Qin, Y. Zhou, and M. Chau, Building domain-specific web collections for scientific digital libraries, Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries , JCDL '04, pp.135-141, 2004.
DOI : 10.1145/996350.996383

T. Qin, T. Liu, J. Xu, and H. Li, LETOR: A benchmark collection for research on learning to rank for information retrieval, Information Retrieval, vol.44, issue.2, pp.346-374, 2010.
DOI : 10.1007/s10791-009-9123-y

G. Qiu, K. Liu, J. Bu, C. Chen, and Z. Et-kang, Quantify query ambiguity using ODP metadata, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '07, pp.697-698, 2007.
DOI : 10.1145/1277741.1277864

D. Ramage, A. Rafferty, and C. Et-manning, Random walks for text semantic similarity, Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, TextGraphs-4, 2009.
DOI : 10.3115/1708124.1708131

J. Rennie and A. Mccallum, Efficient web spidering with reinforcement learning, Proceedings of the 16th international conference on Machine Learning, pp.24-93, 1999.

A. Renouf, A. Kehoe, and J. Et-banerjee, WebCorp: an integrated system for web text search, Language and Computers, vol.59, issue.1, pp.47-67, 2006.
DOI : 10.1163/9789401203791_005

M. Richardson and P. Et-domingos, The intelligent surfer : Probabilistic combination of link and content information in pagerank, Advances in Neural Information Processing Systems, pp.1441-1448, 2002.

M. Richardson, A. Prakash, and E. Brill, Beyond PageRank, Proceedings of the 15th international conference on World Wide Web , WWW '06, pp.707-715, 2006.
DOI : 10.1145/1135777.1135881

C. J. Rijsbergen, Information Retrieval, pp.2-42, 1979.

E. Riloff and R. Et-jones, Learning dictionaries for information extraction by multi-level bootstrapping, Proceedings of the National Conference on Artificial Intelligence, pp.474-479, 1999.

G. Salton and C. Et-buckley, Term-weighting approaches in automatic text retrieval. Information processing & management, pp.513-523, 1988.

D. Sculley, Large scale learning to rank, NIPS 2009 Workshop on Advances in Ranking. Cité, p.107, 2009.

F. Sebastiani and C. Et-ricerche, Machine learning in automated text categorization, ACM Computing Surveys, vol.34, issue.1, pp.1-47, 2002.
DOI : 10.1145/505282.505283

B. Settles, Active learning literature survey. Rapport technique, 2010.

S. Sharoff, Creating general-purpose corpora using automated search engine queries, WaCky ! Working papers on the Web as Corpus, pp.63-98, 2006.

M. Shokouhi and L. Et-si, Federated Search, Foundations and Trends?? in Information Retrieval, vol.5, issue.1, pp.1-102, 2011.
DOI : 10.1561/1500000010

P. Sondhi, R. Chandrasekar, and R. Et-rounthwaite, Using query context models to construct topical search engines, Proceeding of the third symposium on Information interaction in context, IIiX '10, pp.75-84, 2010.
DOI : 10.1145/1840784.1840797

R. Song, Z. Luo, J. Wen, Y. Yu, and H. Et-hon, Identifying ambiguous queries in web search, Proceedings of the 16th international conference on World Wide Web , WWW '07, pp.1169-1170, 2007.
DOI : 10.1145/1242572.1242749

A. Spengler and P. Gallinari, Document structure meets page layout, Proceedings of the 10th ACM symposium on Document engineering, DocEng '10, pp.151-160, 2010.
DOI : 10.1145/1860559.1860590

URL : https://hal.archives-ouvertes.fr/hal-00637719

P. Srinivasan, F. Menczer, and G. Et-pant, A General Evaluation Framework for Topical Crawlers, Information Retrieval, vol.52, issue.3, pp.417-447, 2005.
DOI : 10.1007/s10791-005-6993-5

R. Steele, Techniques for specialized search engines, Proceedings of Internet Computing, pp.25-28, 2001.

T. Tang, N. Craswell, D. Hawking, K. Griffiths, and H. Et-christensen, Quality and relevance of domain-specific search: A case study in mental health, Information Retrieval, vol.87, issue.4, pp.207-225, 2006.
DOI : 10.1007/s10791-006-7150-5

T. Tang, D. Hawking, R. Sankaranarayana, K. Griffiths, and N. Et-craswell, Quality-oriented search for depression portals Advances in Information Retrieval, pp.637-644, 2009.

R. Wang and W. Et-cohen, Language-independent set expansion of named entities using the web. Dans Data Mining, Seventh IEEE International Conference on, pp.342-350, 2007.

J. Wermter and U. Et-hahn, You can't beat frequency (unless you use linguistic knowledge), Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL , ACL '06, pp.785-50, 2006.
DOI : 10.3115/1220175.1220274

Q. Wu, C. J. Burges, K. M. Svore, and J. Et-gao, Adapting boosting for information retrieval measures, Information Retrieval, vol.10, issue.3, pp.254-270, 2010.
DOI : 10.1007/s10791-009-9112-1

H. Yu, J. Han, and K. Chang, PEBL, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining , KDD '02, pp.70-81, 2004.
DOI : 10.1145/775047.775083

H. Yu, F. Huang, and C. Et-lin, Dual coordinate descent methods for logistic regression and maximum entropy models, Machine Learning, pp.41-75, 2011.
DOI : 10.1007/s10994-010-5221-8