Skip to Main content Skip to Navigation

Unsupervised word discovery for computational language documentation

Abstract : Language diversity is under considerable pressure: half of the world’s languages could disappear by the end of this century. This realization has sparked many initiatives in documentary linguistics in the past two decades, and 2019 has been proclaimed the International Year of Indigenous Languages by the United Nations, to raise public awareness of the issue and foster initiatives for language documentation and preservation. Yet documentation and preservation are time-consuming processes, and the supply of field linguists is limited. Consequently, the emerging field of computational language documentation (CLD) seeks to assist linguists in providing them with automatic processing tools. The Breaking the Unwritten Language Barrier (BULB) project, for instance, constitutes one of the efforts defining this new field, bringing together linguists and computer scientists. This thesis examines the particular problem of discovering words in an unsegmented stream of characters, or phonemes, transcribed from speech in a very-low-resource setting. This primarily involves a segmentation procedure, which can also be paired with an alignment procedure when a translation is available. Using two realistic Bantu corpora for language documentation, one in Mboshi (Republic of the Congo) and the other in Myene (Gabon), we benchmark various monolingual and bilingual unsupervised word discovery methods. We then show that using expert knowledge in the Adaptor Grammar framework can vastly improve segmentation results, and we indicate ways to use this framework as a decision tool for the linguist. We also propose a tonal variant for a strong nonparametric Bayesian segmentation algorithm, making use of a modified backoff scheme designed to capture tonal structure. To leverage the weak supervision given by a translation, we finally propose and extend an attention-based neural segmentation method, improving significantly the segmentation performance of an existing bilingual method.
Complete list of metadatas

Cited literature [286 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Friday, September 13, 2019 - 3:44:06 PM
Last modification on : Wednesday, December 9, 2020 - 3:06:33 PM
Long-term archiving on: : Saturday, February 8, 2020 - 1:58:59 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02286425, version 1



Pierre Godard. Unsupervised word discovery for computational language documentation. Artificial Intelligence [cs.AI]. Université Paris-Saclay, 2019. English. ⟨NNT : 2019SACLS062⟩. ⟨tel-02286425⟩



Record views


Files downloads