La Traduction automatique statistique dans un contexte multimodal

Abstract : The performance of Statistical Machine Translation Systems statistics depends on the availability of bilingual parallel texts, also known as bitexts. However, freely available parallel texts are also a sparse resource : the size is often limited, languistic coverage insufficient or the domain of texts is not appropriate. There are relatively few pairs of languages for which parallel corpora sizes are available for some domains. One way to overcome the lack of parallel data is to exploit comparable corpus that are more abundant. Previous work in this area have been applied for the text modality. The question we asked in this thesis is : can comparable multimodal corpus allows us to make solutions to the lack of parallel data in machine translation? In this thesis, we studied how to use resources from different modalities (text or speech) for the development of a Statistical machine translation System. The first part of the contributions is to provide a method for extracting parallel data from a comparable multimodal corpus (text and audio). The audio data are transcribed with an automatic speech recognition system and translated with a machine translation system. These translations are then used as queries to select parallel sentences and generate a bitext. In the second part of the contribution, we aim to improve our method to exploit the sub-sentential entities creating an extension of our system to generate parallel segments. We also improve the filtering module. Finally, we présent several approaches to adapt translation systems with the extracted data. Our experiments were conducted on data from the TED and Euronews web sites which show the feasibility of our approaches.
Document type :
Complete list of metadatas

Cited literature [18 references]  Display  Hide  Download
Contributor : Abes Star <>
Submitted on : Tuesday, January 19, 2016 - 6:04:06 PM
Last modification on : Friday, June 30, 2017 - 12:52:00 PM
Long-term archiving on : Wednesday, April 20, 2016 - 1:00:56 PM


Version validated by the jury (STAR)


  • HAL Id : tel-01259046, version 1



Haithem Afli. La Traduction automatique statistique dans un contexte multimodal. Informatique et langage [cs.CL]. Université du Maine, 2014. Français. ⟨NNT : 2014LEMA1012⟩. ⟨tel-01259046⟩



Record views


Files downloads