Skip to Main content Skip to Navigation

Découverte et caractérisation des corpus comparables spécialisés

Abstract : Comparable corpora are sets of texts written in different languages that are not translations of each other but that share common characteristics. Their main advantage is to be fully representative of linguistics and cultural specificities of their respective language. The Web could theoretically be considered as a comparable corpora source. However, the quality of corpora and of their extracted resources depends on the preliminary definition of corpora and on the carefulness of their compilation (i.e. the definition of common features in comparable corpora). In this thesis, we focus on the compilation of specialized comparable corpora in French and Japanese which documents are extracted from the Web. We propose a definition of these corpora and a set of common features: a specialized domain, a topic and a type of discourse (science or popular science). Our goal is to create a tool to assist comparable corpora compilation. first, we present automatic recognition of common features. Topics can be easily identified with keywords used in Web searches. On the contrary, the detection of the type of discourse needs a wide stylistic analysis. This task is performed over a learning corpus, which leads to the creation of a bilingual typology based on three levels of analysis: structural, modal and lexical. Second, we use this typology to learn a classification model with SVMlight and C4.5. This classification model is tested over an evaluation corpus. Our test results indicate that more than 70 % of the documents are well classified. finally, the classifier is integrated into a comparable corpora compilation assistant tool developed on UIMA system.
Document type :
Complete list of metadatas

Cited literature [111 references]  Display  Hide  Download
Contributor : Lorraine Goeuriot <>
Submitted on : Tuesday, April 20, 2010 - 6:25:51 AM
Last modification on : Monday, October 19, 2020 - 11:08:54 AM
Long-term archiving on: : Tuesday, September 14, 2010 - 4:12:47 PM


  • HAL Id : tel-00474405, version 1



Lorraine Goeuriot. Découverte et caractérisation des corpus comparables spécialisés. Interface homme-machine [cs.HC]. Université de Nantes, 2009. Français. ⟨tel-00474405⟩



Record views


Files downloads