Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia

Abstract : Languages in Malaysia are dying in an alarming rate. As of today, 15 languages are in danger while two languages are extinct. One of the methods to save languages is by documenting languages, but it is a tedious task when performed manually.Automatic Speech Recognition (ASR) system could be a tool to help speed up the process of documenting speeches from the native speakers. However, building ASR systems for a target language requires a large amount of training data as current state-of-the-art techniques are based on empirical approach. Hence, there are many challenges in building ASR for languages that have limited data available.The main aim of this thesis is to investigate the effects of using data from closely-related languages to build ASR for low-resource languages in Malaysia. Past studies have shown that cross-lingual and multilingual methods could improve performance of low-resource ASR. In this thesis, we try to answer several questions concerning these approaches: How do we know which language is beneficial for our low-resource language? How does the relationship between source and target languages influence speech recognition performance? Is pooling language data an optimal approach for multilingual strategy?Our case study is Iban, an under-resourced language spoken in Borneo island. We study the effects of using data from Malay, a local dominant language which is close to Iban, for developing Iban ASR under different resource constraints. We have proposed several approaches to adapt Malay data to obtain pronunciation and acoustic models for Iban speech.Building a pronunciation dictionary from scratch is time consuming, as one needs to properly define the sound units of each word in a vocabulary. We developed a semi-supervised approach to quickly build a pronunciation dictionary for Iban. It was based on bootstrapping techniques for improving Malay data to match Iban pronunciations.To increase the performance of low-resource acoustic models we explored two acoustic modelling techniques, the Subspace Gaussian Mixture Models (SGMM) and Deep Neural Networks (DNN). We performed cross-lingual strategies using both frameworks for adapting out-of-language data to Iban speech. Results show that using Malay data is beneficial for increasing the performance of Iban ASR. We also tested SGMM and DNN to improve low-resource non-native ASR. We proposed a fine merging strategy for obtaining an optimal multi-accent SGMM. In addition, we developed an accent-specific DNN using native speech data. After applying both methods, we obtained significant improvements in ASR accuracy. From our study, we observe that using SGMM and DNN for cross-lingual strategy is effective when training data is very limited.
Complete list of metadatas

Cited literature [140 references]  Display  Hide  Download

https://tel.archives-ouvertes.fr/tel-01314120
Contributor : Abes Star <>
Submitted on : Tuesday, May 10, 2016 - 6:51:31 PM
Last modification on : Thursday, October 11, 2018 - 8:48:02 AM
Long-term archiving on : Wednesday, May 25, 2016 - 8:50:31 AM

File

SAMSONJUAN_2015_archivage.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01314120, version 1

Collections

Citation

Sarah Flora Samson Juan. Exploiting resources from closely-related languages for automatic speech recognition in low-resource languages from Malaysia. Computation and Language [cs.CL]. Université Grenoble Alpes, 2015. English. ⟨NNT : 2015GREAM061⟩. ⟨tel-01314120⟩

Share

Metrics

Record views

259

Files downloads

653