Skip to Main content Skip to Navigation

Acoustic-Visual Speech Synthesis by Bimodal Unit Selection

Utpala Musti 1 
1 PAROLE - Analysis, perception and recognition of speech
Inria Nancy - Grand Est, LORIA - NLPKD - Department of Natural Language Processing & Knowledge Discovery
Abstract : This work deals with audio-visual speech synthesis. In the vast literature available in this direction, many of the approaches deal with it by dividing it into two synthesis problems. One of it is acoustic speech synthesis and the other being the generation of corresponding facial animation. But, this does not guarantee a perfectly synchronous and coherent audio-visual speech. To overcome the above drawback implicitly, we proposed a different approach of acoustic-visual speech synthesis by the selection of naturally synchronous bimodal units. The synthesis is based on the classical unit selection paradigm. The main idea behind this synthesis technique is to keep the natural association between the acoustic and visual modality intact. We describe the audio-visual corpus acquisition technique and database preparation for our system. We present an overview of our system and detail the various aspects of bimodal unit selection that need to be optimized for good synthesis. The main focus of this work is to synthesize the speech dynamics well rather than a comprehensive talking head. We describe the visual target features that we designed. We subsequently present an algorithm for target feature weighting. This algorithm that we developed performs target feature weighting and redundant feature elimination iteratively. This is based on the comparison of target cost based ranking and a distance calculated based on the acoustic and visual speech signals of units in the corpus. Finally, we present the perceptual and subjective evaluation of the final synthesis system. The results show that we have achieved the goal of synthesizing the speech dynamics reasonably well.
Complete list of metadata
Contributor : Slim Ouni Connect in order to contact the contributor
Submitted on : Saturday, January 11, 2014 - 12:27:06 AM
Last modification on : Saturday, June 25, 2022 - 7:41:40 PM
Long-term archiving on: : Friday, April 11, 2014 - 10:25:12 PM



  • HAL Id : tel-01749331, version 2


Utpala Musti. Acoustic-Visual Speech Synthesis by Bimodal Unit Selection. Machine Learning [cs.LG]. Université de Lorraine, 2013. English. ⟨NNT : 2013LORR0003⟩. ⟨tel-01749331v2⟩



Record views


Files downloads