Pronunciation and disfluency modeling for expressive speech synthesis

Abstract : In numerous domains, the usage of synthetic speech is conditioned upon the ability of speech synthesis systems to generate natural and expressive speech. In this frame, we address the problem of expressivity in TTS by incorporating two phenomena with a high impact on speech: pronunciation variants and speech disfluencies. In the first part of this thesis, we present a new pronunciation variant generation method which works by adapting standard i.e., dictionary-based, pronunciations to a spontaneous style. Its strength and originality lie in exploiting a wide range of linguistic, articulatory and acoustic features and to use a probabilistic machine learning framework, namely conditional random fields (CRFs) and language models. Extensive experiments on the Buckeye corpus demonstrate the effectiveness of this approach through objective and subjective evaluations. Listening tests on synthetic speech show that adapted pronunciations are judged as more spontaneous than standard ones, as well as those realized by real speakers. Furthermore, we show that the method can be extended to other adaptation tasks, for instance, to solve the problem of inconsistency between phoneme sequences handled in TTS systems. The second part of this thesis explores a novel approach to automatic generation of speech disfluencies for TTS. Speech disfluencies are one of the most pervasive phenomena in spontaneous speech, therefore being able to automatically generate them is crucial to have more expressive synthetic speech. The proposed approach provides the advantage of generating several types of disfluencies: pauses, repetitions and revisions. To achieve this task, we formalize the problem as a theoretical process, where transformation functions are iteratively composed. We present a first implementation of the proposed process using CRFs and language models, before conducting objective and perceptual evaluations. These experiments lead to the conclusion that our proposition is effective to generate disfluencies, and highlights perspectives for future improvements.
Liste complète des métadonnées

Cited literature [146 references]  Display  Hide  Download

https://hal.inria.fr/tel-01668014
Contributor : Abes Star <>
Submitted on : Thursday, February 15, 2018 - 10:16:06 AM
Last modification on : Friday, January 11, 2019 - 3:15:19 PM

File

QADER_Raheel.pdf
Version validated by the jury (STAR)

Identifiers

  • HAL Id : tel-01668014, version 2

Citation

Raheel Qader. Pronunciation and disfluency modeling for expressive speech synthesis. Artificial Intelligence [cs.AI]. Université Rennes 1, 2017. English. ⟨NNT : 2017REN1S076⟩. ⟨tel-01668014v2⟩

Share

Metrics

Record views

370

Files downloads

338