Délimitation et étiquetage des morphèmes en coréen par ressources linguistiques

Abstract : We present a morphological boundary Korean texts by finite state automata. Korean is an agglutinative language and our system can probably be adapted to other languages ​​with agglutinative suffixes (Hungarian, Finnish, Turkish). The texts are written mainly with Korean Hangul alphabet is a set of syllabic characters. You can mix them with ideographs and characters of the Latin alphabet. We use the UNICODE character encoding in which the Korean syllables are arranged in alphabetical order. For some treatments on the Korean syllable, we decompose each syllable into several Korean alphabet characters. The Korean words are affixes. For the name, a word can have multiple suffixes suffixes excluding derivatives, the maximum number of combinations of about 1600. Our first step in the analysis of Korean text is the description of the morphemes of a word for the segment using the dividers: white symbols. And yet is segmented into morphemes segments. To analyze the segments, we build dictionaries of roots and suffixes sequences. We use the transducers to represent the compatibility between morphemes: roots and suffixes with the GUI UNITEX. They are designed to be built and maintained manually. Our method is based on linguistic resources when most systems are based on morphological analysis of statistical data. We integrate automatic dictionaries of roots and suffixes of the transducers in a single transducer, which performs the function of a dictionary. The result of the analysis of a text is presented as a controller to account for the ambiguity of the division into morphemes. Transitions are labeled by morphemes annotated linguistic information (canonical form, inflected form and linguistic information).
Document type :
Theses
Complete list of metadatas

https://tel.archives-ouvertes.fr/tel-00626255
Contributor : Lingu Ligm <>
Submitted on : Saturday, September 24, 2011 - 12:45:15 PM
Last modification on : Wednesday, April 11, 2018 - 12:12:02 PM
Long-term archiving on : Sunday, December 25, 2011 - 2:20:42 AM

File

Identifiers

  • HAL Id : tel-00626255, version 1

Citation

Hyun Gue Huh. Délimitation et étiquetage des morphèmes en coréen par ressources linguistiques. Autre [cs.OH]. Université Paris-Est, 2005. Français. ⟨tel-00626255⟩

Share

Metrics

Record views

370

Files downloads

1236