Skip to Main content Skip to Navigation

Contrôle de têtes parlantes par inversion acoustico-articulatoire pour l’apprentissage et la réhabilitation du langage

Abstract : Speech sounds may be complemented by displaying speech articulators shapes on a computer screen, hence producing augmented speech, a signal that is potentially useful in all instances where the sound itself might be difficult to understand, for physical or perceptual reasons. In this thesis, we introduce a system called visual articulatory feedback, in which the visible and hidden articulators of a talking head are controlled from the speaker's speech sound. The motivation of this research was to develop such a system that could be applied to Computer Aided Pronunciation Training (CAPT) for learning of foreign languages, or in the domain of speech therapy. We have based our approach to this mapping problem on statistical models build from acoustic and articulatory data. In this thesis we have developed and evaluated two statistical learning methods trained on parallel synchronous acoustic and articulatory data recorded on a French speaker by means of an electromagnetic articulograph. Our Hidden Markov models (HMMs) approach combines HMM-based acoustic recognition and HMM-based articulatory synthesis techniques to estimate the articulatory trajectories from the acoustic signal. Gaussian mixture models (GMMs) estimate articulatory features directly from the acoustic ones. We have based our evaluation of the improvement results brought to these models on several criteria: the Root Mean Square Error between the original and recovered EMA coordinates, the Pearson Product-Moment Correlation Coefficient, displays of the articulatory spaces and articulatory trajectories, as well as some acoustic or articulatory recognition rates. Experiments indicate that the use of states tying and multi-Gaussian per state in the acoustic HMM improves the recognition stage, and that the minimum generation error (MGE) articulatory HMMs parameter updating results in a more accurate inversion than the conventional maximum likelihood estimation (MLE) training. In addition, the GMM mapping using MLE criteria is more efficient than using minimum mean square error (MMSE) criteria. In conclusion, we have found that the HMM inversion system has a greater accuracy compared with the GMM one. Beside, experiments using the same statistical methods and data have shown that the face-to-tongue inversion problem, i.e. predicting tongue shapes from face and lip shapes cannot be solved in a general way, and that it is impossible for some phonetic classes. In order to extend our system based on a single speaker to a multi-speaker speech inversion system, we have implemented a speaker adaptation method based on the maximum likelihood linear regression (MLLR). In MLLR, a linear regression-based transform that adapts the original acoustic HMMs to those of the new speaker was calculated to maximise the likelihood of adaptation data. Finally, this speaker adaptation stage has been evaluated using an articulatory phonetic recognition system, as there are not original articulatory data available for the new speakers. Finally, using this adaptation procedure, we have developed a complete articulatory feedback demonstrator, which can work for any speaker. This system should be assessed by perceptual tests in realistic conditions.
Document type :
Complete list of metadatas

Cited literature [182 references]  Display  Hide  Download
Contributor : Abes Star :  Contact
Submitted on : Tuesday, July 31, 2012 - 10:52:40 AM
Last modification on : Thursday, November 19, 2020 - 12:59:59 PM
Long-term archiving on: : Thursday, November 1, 2012 - 2:30:44 AM


Version validated by the jury (STAR)


  • HAL Id : tel-00721957, version 1


Atef Ben Youssef. Contrôle de têtes parlantes par inversion acoustico-articulatoire pour l’apprentissage et la réhabilitation du langage. Autre. Université de Grenoble, 2011. Français. ⟨NNT : 2011GRENT088⟩. ⟨tel-00721957⟩



Record views


Files downloads