Extraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée

Abstract : Nowadays, machine translation has reached good results when applied to several language pairs such as English – French, English – Chinese, English – Spanish, etc. Empirical translation, particularly statistical machine translation allows us to build quickly a translation system if adequate data is available because statistical machine translation is based on models trained from large parallel bilingual corpora in source and target languages. However, research on machine translation for under-resourced language pairs always faces to the lack of training data. Thus, we have addressed the problem of retrieving a large parallel bilingual text corpus to build a statistical machine translation system. The originality of our work lies in the fact that we focus on under-resourced languages for which parallel bilingual corpora do not exist in most cases. This manuscript presents our methodology for extracting a parallel corpus from a comparable corpus, a richer and more diverse data resource over the Web. We propose three methods of extraction. The first method follows the classical approach using general characteristics of documents as well as lexical information of the document to retrieve both parallel documents and parallel sentence pairs. However, this method requires additional data of the language pair. The second method is a completely unsupervised method that does not require additional data and it can be applied to any language pairs, even under resourced language pairs. The last method deals with the extension of the second method using a third language to improve the extraction process (triangulation). The proposed methods are validated by a number of experiments applied on the under resourced Vietnamese language and the English and French languages.
Document type :
Theses
Complete list of metadatas

Cited literature [52 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00680046
Contributor : Abes Star <>
Submitted on : Saturday, March 17, 2012 - 12:52:19 PM
Last modification on : Thursday, April 18, 2019 - 4:40:45 PM
Long-term archiving on : Monday, June 18, 2012 - 5:06:18 PM

File

20580_DO_2011_archivage1.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-00680046, version 1

Collections

Citation

Thi Ngoc Diep Do. Extraction de corpus parallèle pour la traduction automatique depuis et vers une langue peu dotée. Autre [cs.OH]. Université Grenoble Alpes; Université de Hanoi -- Vietnam, 2011. Français. ⟨NNT : 2011GRENM065⟩. ⟨tel-00680046⟩

Share

Metrics

Record views

1704

Files downloads

6168