Skip to Main content Skip to Navigation
Theses

Méthodes en caractères pour le traitement automatique des langues

Abstract : Data-driven natural language processing has integrated a number of techniques and viewpoints from the field of speech recognition. In particular, the use of the word unit makes it difficult to transpose methods to languages with no orthographic separators. Such methods may not be applied in a multilingual context.
The present work aims at universal and multilingual methods, and therefore promotes the use of character-based methods for natural language processing. Although the word based processing of non-segmenting languages such as Chinese or Japanese requires a segmentation step, using the character, an immediately accessible unit in all languages in their electronic form, makes it unnecessary.

We first transposed to character units a well-known automatic evaluation measure for machine translation, BLEU.
The satisfying results obtained on BLEU lead us to consider other tasks in the field of linguistic data processing: grammatical filtering, and data profiling of the similarity and homogeneity of linguistic resources. Character based processing lead to satisfying results, comparable to those obtained when using words.
Last, we considered tasks in data generation: proportional analogy on character strings allows the automatic generation of paraphrases, as well as machine translation (MT).
This work shows that a complete MT system may be built which does not require any segmentation of linguistic data, and which may therefore handle non-segmenting languages with no preprocessing.
Document type :
Theses
Complete list of metadatas

Cited literature [126 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00107056
Contributor : Etienne Denoual <>
Submitted on : Tuesday, October 17, 2006 - 11:19:39 AM
Last modification on : Friday, November 6, 2020 - 4:14:29 AM
Long-term archiving on: : Thursday, September 20, 2012 - 12:00:43 PM

Identifiers

  • HAL Id : tel-00107056, version 1

Collections

UJF | IMAG | CNRS | UGA

Citation

Etienne Denoual. Méthodes en caractères pour le traitement automatique des langues. Autre [cs.OH]. Université Joseph-Fourier - Grenoble I, 2006. Français. ⟨tel-00107056⟩

Share

Metrics

Record views

355

Files downloads

439