Construction et évaluation pour la TA d'un corpus journalistique bilingue : application au français-somali

Abstract : As part of ongoing work to computerize a large number of "poorly endowed" languages, especially those in the French-speaking world, we have created a French-Somali machine translation system dedicated to a journalistic sub-language, allowing to obtain quality translations from a bilingual body built by post-editing of GoogleTranslate results for the Somali and non-French speaking populations of the Horn of Africa. For this, we have created the very first quality French-Somali parallel corpus, comprising to date 98,912 words (about 400 standard pages) and 10,669 segments. The latter is an aligned corpus of very good quality, because we built in by post-editions editing pre-translations of produced by GT, which uses with a combination of the its French-English and English-Somali MT language pairs. It That corpus was also evaluated by 9 bilingual annotators who gave assigned a quality note score to each segment of the corpus and corrected our post-editing. From Using this growing body corpus as training corpusof work, we have built several successive versions of a MosesLIG-fr-so fragmented statistical Phrase-Based Automatic Machine Translation System (PBMT), which has proven to be better than GoogleTranslate on this language pair and this sub-language, in terms BLEU and of post-editing time. We also did used OpenNMT to build a first French-Somali neural automatic translationMT system and experiment it.in order to improve the results of TA without leading to prohibitive calculation times, both during training and during decoding.On the other hand, we have set up an iMAG (multilingual interactive access gateway) that allows non-French-speaking Somali surfers on the continent to access the online edition of the newspaper "La Nation de Djibouti" in Somali. The segments (sentences or titles), pre- automatically translated automatically by our any available fr-so MT system, can be post-edited and rated (out on a 1 to of 20scale) by the readers themselves, so as to improve the system by incremental learning, in the same way as the has been done before for the French-Chinese PBMT system. (PBMT) created by [Wang, 2015].
Complete list of metadatas

Cited literature [136 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-02269987
Contributor : Abes Star <>
Submitted on : Friday, August 23, 2019 - 2:58:41 PM
Last modification on : Tuesday, August 27, 2019 - 9:18:09 AM

File

AHMED_ASSOWE_2019_archivage.pd...
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-02269987, version 1

Collections

STAR | LIG | UGA

Citation

Houssein Ahmed Assowe. Construction et évaluation pour la TA d'un corpus journalistique bilingue : application au français-somali. Informatique et langage [cs.CL]. Université Grenoble Alpes, 2019. Français. ⟨NNT : 2019GREAM019⟩. ⟨tel-02269987⟩

Share

Metrics

Record views

183

Files downloads

57