Skip to Main content Skip to Navigation

Reconnaissance automatique de la parole d'enfants apprenant·e·s lecteur·ice·s en salle de classe : modélisation acoustique de phonèmes

Lucile Gelin 1 
Abstract : In this PhD thesis, we aim at perfecting the phonetic transcriptions of oral readings of children learning to read, recorded in a classroom environment. These automatic transcriptions power a reading mistakes detection system used in the reading aloud exercise of the Lalilo pedagogical platform. Good accuracy is essential to provide appropriate feedback to the child, thus promoting his·her learning. A first section presents the main challenges of our task. The automatic recognition of children's speech is more difficult than adults' speech, due to its very high acoustic and prosodic variability. The scarcity of available data, especially in French, requires us to be more inventive as to correctly model its variability. Finally, frequent occurrences of fluency and decoding mistakes, as well as the presence of classroom babble noise, constitute additional difficulties. In a second section, we build a hybrid TDNNF-HMM acoustic model, which will become our baseline model. Using transfer learning allows to overcome the lack of data and achieve a PER of 30.1%. We study different acoustic parameters and normalization methods, aiming at maximizing our model's performance. Data augmentation by adding noise with the objective of improving the model's robustness to classroom babble noise further improves the PER by 6.4% relative. In our final section, we explore recent end-to-end architectures based on RNNs, CTC modules and attention mechanisms. Our work is one of the first to apply end-to-end architectures to child speech and to analyze their strengths and weaknesses with respect to the specificities of oral reading by children learning to read. Our Transformer+CTC system provides the best results (25.0% PER) thanks to the relevance of the acoustic and textual information extracted by its self-attention mechanisms and the complementarity of the CTC and attention modules. Our system is then enhanced with data augmentation techniques. In particular, we introduce an innovative method of simulating reading mistakes, that seeks to train the model to better detect them. It reveals complementary to the noise data augmentation previously studied. These two techniques then allow the Transformer+CTC to greatly outperform the hybrid reference model, with a PER of 21.2%, and to improve the quality of its transcriptions over misreadings or classroom babble noise.
Document type :
Complete list of metadata
Contributor : ABES STAR :  Contact
Submitted on : Wednesday, July 6, 2022 - 3:59:12 PM
Last modification on : Friday, July 8, 2022 - 4:15:04 AM


Version validated by the jury (STAR)


  • HAL Id : tel-03715653, version 1


Lucile Gelin. Reconnaissance automatique de la parole d'enfants apprenant·e·s lecteur·ice·s en salle de classe : modélisation acoustique de phonèmes. Intelligence artificielle [cs.AI]. Université Paul Sabatier - Toulouse III, 2022. Français. ⟨NNT : 2022TOU30031⟩. ⟨tel-03715653⟩



Record views


Files downloads