Methods and Tools for Weak Problems of Translation

Muhammad Ghulam Abbas Malik

Theses Year : 2010

Methods and Tools for Weak Problems of Translation

Méthodes et outils pour les problèmes faibles de traduction

(1, 2)

1
2

Muhammad Ghulam Abbas Malik

Function : Author
PersonId : 172438
IdHAL : m-g-abbas-malik
ORCID : 0000-0002-0679-8346

Laboratoire d'Informatique de Grenoble

Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole

Abstract

Given a source language L1 and a target language L2, a written translation unit S in L1 of n words may have an exponential number N=O(kn)) number of valid translations T1...TN. We are interested in the case where N is very small because of the proximity of the written forms of L1 and L2. Our domain of investigation is the class of pairs of language and writing system combinations (Li-Wi, Lj-Wj) such that there may be only one or a very small number of valid translations for any given S of Li written in Wi. The problem of translating a Hindi/Urdu sentence written in Urdu into an equivalent one in Devanagari falls in this class. We call the problem of translation for such a pair a weak translation problem. We have designed and experimented methods of increasing complexity for solving in-stances of this problem, from simple finite-state transduction to the transformation of charts of partial syntax trees, with or without the inclusion of empirical (mainly proba-bilistic) methods. That leads to the identification of the translation difficulty of a (Li-Wi, Lj-Wj) pair as the degree of complexity of the translation methods achieving a de-sired goal (such as less than 15% error rate). Considering transliteration or transcription as a special case of translation, we have developed a method based on the definition of a universal intermediate transcription (UIT) for given groups of Li-Wi couples and used UIT as a phonetico-graphemic pivot. For handling interdialectal translation into lan-guages with rich flexional morphology, we propose to perform a limited on-demand surface analysis into partial syntax trees and to use it to update and propagate features such as gender and number and to handle word boundary phenomena. Beside large-scale experiments, this work has led to the production of linguistic re-sources such as parallel and tagged corpora and of running systems, all freely available on the Web. They include monolingual corpora, lexicons, morphological analyzers with limited vocabulary, phrase structure grammars of Hindi, Punjabi and Urdu, online web-services for transliteration between Hindi & Urdu, Punjabi (Shahmukhi) & Punjabi (Gurmukhi), etc. An interesting perspective is to apply our techniques to distant L-W pairs, for which they could efficiently produce active learning presentations in the form of multiple pidgin outputs.

Étant données une langue source L1 et une langue cible L2, un segment (phrase ou titre) S de n mots écrit en L1 peut avoir un nombre exponentiel N=O(kn) de traductions valides T1...TN. Nous nous intéressons au cas où N est très faible en raison de la proximité des formes écrites de L1 et L2. Notre domaine d'investigation est la classe des paires de combinaisons de langue et de système d'écriture (Li-Wi, Lj-Wj) telles qu'il peut y avoir une seule traduction valide, ou un très petit nombre de traductions valides, pour tout segment S de Li écrit en Wi. Le problème de la traduction d'une phrase hindi/ourdou écrite en ourdou vers une phrase équivalente en devanagari tombe dans cette classe. Nous appelons le problème de la traduction pour une telle paire un problème faible de traduction. Nous avons conçu et expérimenté des méthodes de complexité croissante pour résoudre des instances de ce problème, depuis la transduction à états finis simple jusqu'à à la transformation de graphes de chaînes d'arbres syntaxiques partiels, avec ou sans l'inclusion de méthodes empiriques (essentiellement probabilistes). Cela conduit à l'identification de la difficulté de traduction d'une paire (Li-Wi, Lj-Wj) comme le degré de complexité des méthodes de traduction atteignant un objectif souhaité (par exemple, moins de 15% de taux d'erreur). Considérant la translittération ou la transcription comme un cas spécial de traduction, nous avons développé une méthode basée sur la définition d'une transcription intermédiaire universelle (UIT) pour des groupes donnés de couples Li-Wi, et avons utilisé UIT comme un pivot phonético-graphémique. Pour traiter la traduction interdialectale dans des langues à morphologie flexionnelle riche, nous proposons de faire une analyse de surface sur demande et limitée, produisant des arbres syntaxiques partiels, et de l'employer pour mettre à jour et propager des traits tels que le genre et le nombre, et pour traiter les phénomènes aux limites des mots. A côté d'expériences à grande échelle, ce travail a conduit à la production de ressources linguistiques telles que des corpus parallèles et annotés, et à des systèmes opérationnels, tous disponibles gratuitement sur le Web. Ils comprennent des corpus monolingues, des lexiques, des analyseurs morphologiques avec un vocabulaire limité, des grammaires syntagmatiques du hindi, du punjabi et de l'ourdou, des services Web en ligne pour la translittération entre hindi et ourdou, punjabi (shahmukhi) et punjabi (gurmukhi), etc. Une perspective intéressante est d'appliquer nos techniques à des paires distantes LW, pour lesquelles elles pourraient produire efficacement des présentations d'apprentissage actif, sous la forme de sorties pidgin multiples.

Keywords

Seraiki Punjabi Sindhi Kashmiri Hindi South Asian Languages Writing Systems Urdu Probabilistic Methods Partial Phrase Structure Analysis Tree Transformation Empirical Methods Partial Syntax Tree Word-to-word Transformation Machine Translation Weak Translation Problem Multiscriptural processing Multilingual processing Finite-state Automata Finite-state Transducers Rule-based Methodology Interlingua Approach Intermediate Transcription Graph-based Approach Interactive Translation Morphology Morphological Transformation Machine Transliteration

cachemirien méthodes probabilistes langues de l'Asie du sud ourdu systèmes d'écriture méthodes empiriques analyse partielle en constituants arbre syntaxique partiel transformation d'arbres transducteurs d'états finis traitement multiscriptural traitement multilingue automates d'états finis problème faible de traduction Traduction Automatique translittération automatique approche interlingue méthodologie basée sur des règles transcription intermédiaire approche basée sur les graphes traduction interactive morphologie transformation morphologique transformation mot-à-mot

Domains

Computer Science [cs]

Fichier principal

Thesis_Abbas_Malik_-_GETALP_-_LIG.pdf (4.18 Mo)

Muhammad Ghulam Abbas Malik : Connect in order to contact the contributor

https://theses.hal.science/tel-00502192

Submitted on : Tuesday, July 13, 2010-1:54:47 PM

Last modification on : Thursday, April 4, 2024-9:10:23 PM

Long-term archiving on: Thursday, October 14, 2010-3:41:01 PM

Dates and versions

tel-00502192 , version 1 (13-07-2010)

Identifiers

HAL Id : tel-00502192 , version 1

Cite

Muhammad Ghulam Abbas Malik. Methods and Tools for Weak Problems of Translation. Computer Science [cs]. Université Joseph-Fourier - Grenoble I, 2010. English. ⟨NNT : ⟩. ⟨tel-00502192⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS UJF LIG LIG_TDCGE LIG_TDCGE_GETALP LIG_SIDCH

407 View

985 Download

Methods and Tools for Weak Problems of Translation

Méthodes et outils pour les problèmes faibles de traduction

Abstract

Keywords

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Share