Les Triggers Inter-langues pour la Traduction Automatique Statistique

Caroline Lavecchia 1
1 PAROLE - Analysis, perception and recognition of speech
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : During my Ph.D. study, I conducted research in Machine Translation (MT), i.e. finding a possible target translation of a source sentence without any human interference. My works focused on statistical approach of MT which consists in using different probabilistic models trained on large amount of parallel corpora to retrieve the most likelihood translation given a source sentence. My thesis addresses two issues related to Statistical Machine Translation (SMT) : the collect of aligned parallel corpora and the estimation of translation models given these corpora. An SMT system extracts the knowledge necessary to perform automatic translation from parallel corpora where each source sentence is aligned with its translation in a target language. Most researches dealing with SMT use as parallel corpora the proceedings of the European Parliament available in many languages. Such corpora are not convenient for spontaneous speech translation. That's why I decided to use movie subtitles in order to achieve a more realistic machine translation system. Movie subtitles are considered as difficult data and cannot be used as parallel corpora for SMT without processing. I proposed an original algorithm based on Dynamic Time Wrapping to automatically align movie subtitles. Thus, I obtained parallel corpora that constitute a rich resource to train SMT system. In SMT, different statistical models are trained on parallel corpora such as alignment model, translation table, or distortion model. The translation table is the major model needed by an SMT system to perform the process. It gives the translation probability between target and source words. Existing methods usually estimate these tables based on word alignment which is obtained through complex and thus time consuming algorithms. My principal purpose was to rethink the problem and to prospect new options for generating the translation tables, at word and phrase level, which are totally different from state-of-the-art solutions. I proposed an original approach based on inter-lingual triggers, which does not require any alignment at word level. Inter-lingual triggers allow revealing highly correlated source and target word sequences by computing Mutual Information (MI) between them. The idea behind this concept is that if a source sequence is strongly correlated with a target one in terms of MI then we suppose that the occurrence of the first triggers the occurrence of the last and vice versa. I proposed to use inter-lingual triggers on parallel corpora in order to retrieve probable translations of word sequences and thus constitute a translation table. MI is a co-occurrence measure easily computable in one pass on parallel corpora. For selecting inter-lingual triggers, we assume that two sequences co-occur if they appear in at least one pair of sentences of the parallel corpora. Thus, the method that I proposed does not require alignment at word level but only at sentence level. The use of inter-lingual triggers makes my approach to estimate translation tables less complex but as efficient as existing approaches. At word level, the translation table obtained with interlingual triggers conducted to automatic translations with better quality, in terms of BLEU score, than those produced with a word translation table estimated by the well-know IBM model 3. At phrase level, the translation table based on inter-lingual triggers leads to automatic translations with a BLEU score greater than 34 and very close to those obtained by a phrase translation table estimated with a state-of-the-art method which requires word alignment on the parallel corpora. Keywords: Statistical Machine Translation, Inter-lingual Triggers, phrase-based Machine Translation
Liste complète des métadonnées

Cited literature [62 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00545463
Contributor : Caroline Lavecchia <>
Submitted on : Friday, December 10, 2010 - 11:57:35 AM
Last modification on : Monday, April 16, 2018 - 10:41:47 AM
Document(s) archivé(s) le : Thursday, June 30, 2011 - 1:42:07 PM

Identifiers

  • HAL Id : tel-00545463, version 1

Citation

Caroline Lavecchia. Les Triggers Inter-langues pour la Traduction Automatique Statistique. Informatique [cs]. Université Nancy II, 2010. Français. ⟨tel-00545463⟩

Share

Metrics

Record views

934

Files downloads

1385