Vers une modélisation statistique multi-niveau du langage, application aux langues peu dotées

Abstract : This PhD thesis focuses on the problems encountered when developing automatic speech recognition for under-resourced languages with a writing system without explicit separation between words. The specificity of the languages covered in our work requires automatic segmentation of text corpus into words in order to make the n-gram language modeling applicable. While the lack of text data has an impact on the performance of language model, the errors introduced by automatic segmentation can make these data even less usable. To deal with these problems, our research focuses primarily on language modeling, and in particular the choice of lexical and sub-lexical units, used by the recognition systems. We investigate the use of multiple units in speech recognition system. We validate these modeling approaches based on multiple units in recognition systems for a group of languages : Khmer, Vietnamese, Thai and Laotian.
Document type :
Theses
Complete list of metadatas

Cited literature [1 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-00646236
Contributor : Sopheap Seng <>
Submitted on : Tuesday, November 29, 2011 - 2:50:42 PM
Last modification on : Friday, October 25, 2019 - 1:31:45 AM
Long-term archiving on: Monday, December 5, 2016 - 8:27:22 AM

Identifiers

  • HAL Id : tel-00646236, version 1

Collections

Citation

Sopheap Seng. Vers une modélisation statistique multi-niveau du langage, application aux langues peu dotées. Informatique et langage [cs.CL]. Université de Grenoble, 2010. Français. ⟨tel-00646236⟩

Share

Metrics

Record views

369

Files downloads

383