Skip to Main content Skip to Navigation

Découverte d'unités linguistiques à l'aide de méthodes d'apprentissage non supervisé

Abstract : The discovery of elementary linguistic units (phonemes, words) only from sound recordings is an unresolved problem that arouses a strong interest from the community of automatic speech processing, as evidenced by the many recent contributions of the state of the art. During this thesis, we focused on using neural networks to answer the problem. We approached the problem using neural networks in a supervised, poorly supervised and multilingual manner. We have developed automatic phoneme segmentation and phonetic classification tools based on convolutional neural networks. The automatic segmentation tool obtained 79% F-measure on the BUCKEYE conversational speech corpus. This result is similar to a human annotator according to the inter-annotator agreement provided by the creators of the corpus. In addition, it does not need a lot of data (about ten minutes per speaker and 5 different speakers) to be effective. In addition, it is portable to other languages (especially for poorly endowed languages such as xitsonga). The phonetic classification system makes it possible to set the various parameters and hyperparameters that are useful for an unsupervised scenario. In the unsupervised context, the neural networks (Auto-Encoders) allowed us to generate new parametric representations, concentrating the information of the input frame and its neighboring frames. We studied their utility for audio compression from the raw signal, for which they were effective (low RMS, even at 99% compression). We also carried out an innovative pre-study on a different use of neural networks, to generate vectors of parameters not from the outputs of the layers but from the values of the weights of the layers. These parameters are designed to mimic Linear Predictive Coefficients (LPC). In the context of the unsupervised discovery of phoneme-like units (called pseudo-phones in this memory) and the generation of new phonetically discriminative parametric representations, we have coupled a neural network with a clustering tool (k-means ). The iterative alternation of these two tools allowed the generation of phonetically discriminating parameters for the same speaker: low rates of intra-speaker ABx error of 7.3% for English, 8.5% for French and 8 , 4% for Mandarin were obtained. These results allow an absolute gain of about 4% compared to the baseline (conventional parameters MFCC) and are close to the best current approaches (1% more than the winner of the Zero Resource Speech Challenge 2017). The inter-speaker results vary between 12% and 15% depending on the language, compared to 21% to 25% for MFCCs.
Document type :
Complete list of metadata

Cited literature [244 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Wednesday, July 8, 2020 - 3:20:15 PM
Last modification on : Wednesday, November 3, 2021 - 6:52:02 AM
Long-term archiving on: : Monday, November 30, 2020 - 3:21:11 PM


Version validated by the jury (STAR)


  • HAL Id : tel-02893779, version 1


Céline Manenti. Découverte d'unités linguistiques à l'aide de méthodes d'apprentissage non supervisé. Intelligence artificielle [cs.AI]. Université Paul Sabatier - Toulouse III, 2019. Français. ⟨NNT : 2019TOU30074⟩. ⟨tel-02893779⟩



Les métriques sont temporairement indisponibles