Méthodes de segmentation et d'analyse automatique de textes thaï

Abstract : The aim of this thesis is to design and implement a computational linguistic module for analysing Thai texts under the INTEX © system. Based essentially on Indo-European languages written in the Latin alphabet, INTEX © encounters some difficulties when processing a very different language such as Thai. The crucial problem is word and sentence segmentation, since Thai has no word separator: a sentence is written as a continuous sequence of letters, and sentence separators are frequently ambiguous. Accordingly, we have developed and evaluated two methods of word segmentation, firstly by using Regular Expressions and secondly Finite-State Transducers, which segment Thai texts into letters and syllables respectively. We have also created Thai Electronic Dictionaries, which are used to recognise words from letters or from syllables and, at the same time, to label them with syntactic and semantic tags. Two methods of Thai sentence segmentation, based on punctuation marks and keywords, are also proposed and evaluated. Finally, we demonstrate that, as a result of our work, INTEX © is able to analyse Thai documents in spite of the difficulties involved.
Document type :
Complete list of metadatas

Contributor : Lingu Ligm <>
Submitted on : Thursday, September 29, 2011 - 9:30:55 AM
Last modification on : Wednesday, April 11, 2018 - 12:12:02 PM
Long-term archiving on : Tuesday, November 13, 2012 - 2:46:40 PM


  • HAL Id : tel-00626256, version 1


Krit Kosawat. Méthodes de segmentation et d'analyse automatique de textes thaï. Autre [cs.OH]. Université Paris-Est, 2003. Français. ⟨tel-00626256⟩



Record views


Files downloads