Méthodes de segmentation et d'analyse automatique de textes thaï

Abstract : The aim of this thesis is to design and implement a computational linguistic module for analysing Thai texts under the INTEX © system. Based essentially on Indo-European languages written in the Latin alphabet, INTEX © encounters some difficulties when processing a very different language such as Thai. The crucial problem is word and sentence segmentation, since Thai has no word separator: a sentence is written as a continuous sequence of letters, and sentence separators are frequently ambiguous. Accordingly, we have developed and evaluated two methods of word segmentation, firstly by using Regular Expressions and secondly Finite-State Transducers, which segment Thai texts into letters and syllables respectively. We have also created Thai Electronic Dictionaries, which are used to recognise words from letters or from syllables and, at the same time, to label them with syntactic and semantic tags. Two methods of Thai sentence segmentation, based on punctuation marks and keywords, are also proposed and evaluated. Finally, we demonstrate that, as a result of our work, INTEX © is able to analyse Thai documents in spite of the difficulties involved.
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-00626256
Contributor : Lingu Ligm <>
Submitted on : Thursday, September 29, 2011 - 9:30:55 AM
Last modification on : Wednesday, April 11, 2018 - 12:12:02 PM
Long-term archiving on : Tuesday, November 13, 2012 - 2:46:40 PM

Identifiers

  • HAL Id : tel-00626256, version 1

Citation

Krit Kosawat. Méthodes de segmentation et d'analyse automatique de textes thaï. Autre [cs.OH]. Université Paris-Est, 2003. Français. ⟨tel-00626256⟩

Share

Metrics

Record views

426

Files downloads

5254