Model adaptation techniques in machine translation

Abstract : Nowadays several indicators suggest that the statistical approach to machinetranslation is the most promising. It allows fast development of systems for anylanguage pair provided that sufficient training data is available.Statistical Machine Translation (SMT) systems use parallel texts ‐ also called bitexts ‐ astraining material for creation of the translation model and monolingual corpora fortarget language modeling.The performance of an SMT system heavily depends upon the quality and quantity ofavailable data. In order to train the translation model, the parallel texts is collected fromvarious sources and domains. These corpora are usually concatenated, word alignmentsare calculated and phrases are extracted.However, parallel data is quite inhomogeneous in many practical applications withrespect to several factors like data source, alignment quality, appropriateness to thetask, etc. This means that the corpora are not weighted according to their importance tothe domain of the translation task. Therefore, it is the domain of the training resourcesthat influences the translations that are selected among several choices. This is incontrast to the training of the language model for which well‐known techniques areused to weight the various sources of texts.We have proposed novel methods to automatically weight the heterogeneous data toadapt the translation model.In a first approach, this is achieved with a resampling technique. A weight to eachbitexts is assigned to select the proportion of data from that corpus. The alignmentscoming from each bitexts are resampled based on these weights. The weights of thecorpora are directly optimized on the development data using a numerical method.Moreover, an alignment score of each aligned sentence pair is used as confidencemeasurement.In an extended work, we obtain such a weighting by resampling alignments usingweights that decrease with the temporal distance of bitexts to the test set. By thesemeans, we can use all the available bitexts and still put an emphasis on the most recentone. The main idea of our approach is to use a parametric form or meta‐weights for theweighting of the different parts of the bitexts. This ensures that our approach has onlyfew parameters to optimize.In another work, we have proposed a generic framework which takes into account thecorpus and sentence level "goodness scores" during the calculation of the phrase‐tablewhich results into better distribution of probability mass of the individual phrase pairs.
Document type :
Theses
Other [cs.OH]. Université du Maine, 2012. English. <NNT : 2012LEMA1003>


https://tel.archives-ouvertes.fr/tel-00718226
Contributor : Abes Star <>
Submitted on : Monday, July 16, 2012 - 2:53:09 PM
Last modification on : Monday, July 16, 2012 - 3:12:08 PM

File

2012LEMA1003_converti.pdf
fileSource_public_star

Identifiers

  • HAL Id : tel-00718226, version 1

Collections

Citation

Kashif Shah. Model adaptation techniques in machine translation. Other [cs.OH]. Université du Maine, 2012. English. <NNT : 2012LEMA1003>. <tel-00718226>

Export

Share

Metrics

Consultation de
la notice

414

Téléchargement du document

153