Skip to Main content Skip to Navigation

Mesures de comparabilité pour la construction assistée de corpus comparables bilingues thématiques

Guiyao Ke 1 
1 EXPRESSION - Expressiveness in Human Centered Data/Media
UBS - Université de Bretagne Sud, IRISA-D6 - MEDIA ET INTERACTIONS
Abstract : Thematic comparable corpora regroup texts from a same topic and written in several languages, highly similar but without mutual translations. Comparing with parallel corpora which regroup pairs of translations, comparable corpora have three advantages: firstly, they are rich and big resources jointly in volume and in covered period; secondly, comparable corpora provide original language and thematic resources. Finally, they are less expensive to develop than parallel corpus. With the considerable development of the WEB, an abundant raw material is exploitable for the construction of comparable corpora. However, the quality of comparable corpus is essential for their use in various fields such as automatic or assisted translation, bilingual terminology extraction, multilingual information retrieval, etc. The objective of this thesis work is to develop a methodological approach and a software toolkit to offer assistance in the construction of thematic bilingual comparable corpora from the WEB and on demand. We first introduce the general concept of comparability that maps two linguistic spaces and then, from a referenced quantitative comparability measure, we propose two variants that we qualify as thematic comparability measures. We evaluate these quantitative measures following a protocol based on the gradual degradation of a parallel corpus. Then, a new method to improve the co-clustering and co-classification of bilingual documents, as well as the alignment of comparable clusters, is developed. This approach merges native similarities defined in each language space with the similarity that is induced by a comparability measure. Finally, we propose an integrated approach, based on the above mentioned contributions, in order to assist the construction from the WEB, of thematic bilingual comparable corpora of ?good quality?. This procedure comprises a step of manual validation to ensure the quality of the comparable clusters alignment. Tuning the alignment comparability threshold, thematic comparable corpora with various comparability levels can be provided according to some specified requirements. The experiments that we have conducted on RSS feeds collected from major international newspapers appear relevant and promising.
Document type :
Complete list of metadata

Cited literature [188 references]  Display  Hide  Download
Contributor : Pierre-François Marteau Connect in order to contact the contributor
Submitted on : Monday, June 2, 2014 - 3:53:22 PM
Last modification on : Tuesday, October 19, 2021 - 11:58:58 PM
Long-term archiving on: : Tuesday, September 2, 2014 - 10:45:36 AM


  • HAL Id : tel-00997837, version 1


Guiyao Ke. Mesures de comparabilité pour la construction assistée de corpus comparables bilingues thématiques. Traitement du texte et du document. Université de Bretagne Sud, 2014. Français. ⟨tel-00997837⟩



Record views


Files downloads