Skip to Main content Skip to Navigation
Theses

Extraction d'information `a partir de documents Web multilingues : une approche d'analyses structurelles

Abstract : MultilingualWeb Document (MWD) processing has become one of the major interests of research and development in the area of information retrieval. Therefore, we observed that the structure of the multilingual resources has not been enough explored in most of the research works in this area. We consider that links structure embed crucial information for both hyperdocument retrieving and mining process. In this context, we wonder to remind that each Web site is considered as a hyper-document that contains a set of Web documents (pages, screen, messages) which can be explored through the links paths. Therefore, detecting the dominant languages, in a Web Site, could be done in a different ways. The framework of this experimental research thesis is structures analysis for information extraction from a great number of heterogeneous structured or semi-structured electronic documents (essentially the Web document). It covers the following aspects : enumerating the dominants languages, setting-up (virtual) frontiers between those languages, enabling further processing, recognizing the dominants languages.
Document type :
Theses
Complete list of metadatas

Cited literature [129 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00258948
Contributor : Hal System <>
Submitted on : Tuesday, February 26, 2008 - 10:31:03 AM
Last modification on : Tuesday, February 5, 2019 - 12:12:10 PM
Long-term archiving on: : Thursday, May 20, 2010 - 6:39:07 PM

Identifiers

  • HAL Id : tel-00258948, version 1

Citation

Tuan Dang Nguyen. Extraction d'information `a partir de documents Web multilingues : une approche d'analyses structurelles. Autre [cs.OH]. Université de Caen, 2006. Français. ⟨tel-00258948⟩

Share

Metrics

Record views

491

Files downloads

2152