Methods and Tools for Weak Problems of Translation

Abstract : Given a source language L1 and a target language L2, a written translation unit S in L1 of n words may have an exponential number N=O(kn)) number of valid translations T1...TN. We are interested in the case where N is very small because of the proximity of the written forms of L1 and L2. Our domain of investigation is the class of pairs of language and writing system combinations (Li-Wi, Lj-Wj) such that there may be only one or a very small number of valid translations for any given S of Li written in Wi. The problem of translating a Hindi/Urdu sentence written in Urdu into an equivalent one in Devanagari falls in this class. We call the problem of translation for such a pair a weak translation problem. We have designed and experimented methods of increasing complexity for solving in-stances of this problem, from simple finite-state transduction to the transformation of charts of partial syntax trees, with or without the inclusion of empirical (mainly proba-bilistic) methods. That leads to the identification of the translation difficulty of a (Li-Wi, Lj-Wj) pair as the degree of complexity of the translation methods achieving a de-sired goal (such as less than 15% error rate). Considering transliteration or transcription as a special case of translation, we have developed a method based on the definition of a universal intermediate transcription (UIT) for given groups of Li-Wi couples and used UIT as a phonetico-graphemic pivot. For handling interdialectal translation into lan-guages with rich flexional morphology, we propose to perform a limited on-demand surface analysis into partial syntax trees and to use it to update and propagate features such as gender and number and to handle word boundary phenomena. Beside large-scale experiments, this work has led to the production of linguistic re-sources such as parallel and tagged corpora and of running systems, all freely available on the Web. They include monolingual corpora, lexicons, morphological analyzers with limited vocabulary, phrase structure grammars of Hindi, Punjabi and Urdu, online web-services for transliteration between Hindi & Urdu, Punjabi (Shahmukhi) & Punjabi (Gurmukhi), etc. An interesting perspective is to apply our techniques to distant L-W pairs, for which they could efficiently produce active learning presentations in the form of multiple pidgin outputs.
Document type :
Theses
Computer Science. Université Joseph-Fourier - Grenoble I, 2010. English


https://tel.archives-ouvertes.fr/tel-00502192
Contributor : Muhammad Ghulam Abbas Malik <>
Submitted on : Tuesday, July 13, 2010 - 1:54:47 PM
Last modification on : Tuesday, July 13, 2010 - 2:39:18 PM

Identifiers

  • HAL Id : tel-00502192, version 1

Collections

Citation

Muhammad Ghulam Abbas Malik. Methods and Tools for Weak Problems of Translation. Computer Science. Université Joseph-Fourier - Grenoble I, 2010. English. <tel-00502192>

Export

Share

Metrics

Consultation de
la notice

278

Téléchargement du document

180