On temporal constraints for deep neural voice alignment

Yann Teytaut

Résumé

To listen, to respond, to make coincide, to coordinate, to adjust, to follow, to adapt, to be in unison, to synchronize, to align... The rich vocabulary dedicated to the correspondence of human activities shows the importance of their temporal organization. Human communication, multi-modal by nature, is fully concerned by this problematic since there exists a semantic gap between oral locutions and their symbolic sequences: how to interpret a written message without the vocal intonation? what performative style beyond a fixed musical score? This thesis proposes to uncover the complex underlying relationships between the audio and symbolic domains in order to reduce this gap through the fine study of the inherent temporality contained in voice recordings. The voice alignment task lies at the core of this objective, as it aims to determine the temporal occurrence of symbols that are assumed to be present in a voice signal. This work notably focuses on the development of an acoustic model, ADAGIO, capable of estimating such time-symbol links. Recent progress in deep learning have led to implement ADAGIO as a deep neural network in a powerful generic formalism: the “Connectionist Temporal Classification” (CTC). However, the great flexibility offered by CTC is undermined by its intrinsic lack of guarantees for temporally accurate predictions. Therefore, the key contributions of this research consist in reinforcing CTC with additional temporal constraints to improve the quality of the inferred alignments. To do so, three ancillary tasks of (1) spectral content reconstruction; (2) audio structure propagation; and (3) guided monotony are introduced and induce a positive impact on the alignment between voices, texts, and notes. Then, ADAGIO contributes to many practical applications via collaborations such as concatenative speech synthesis or the study of expressive production strategies at play for both social attitudes in speech and singing style in musical performances.

S’écouter, se répondre, faire se coïncider, se coordonner, s’accorder, se suivre, s’adapter, être à l’unisson, se synchroniser, s’aligner... Le riche vocabulaire dédié à la mise en correspondance dans le temps des activités humaines montre l’importance que revêt leur organisation temporelle. La communication humaine, multi-modale par nature, est pleinement concernée par cette problématique puisqu’il existe un écart sémantique entre les locutions orales et leurs séquences symboliques : comment bien interpréter un message écrit sans l’intonation vocale ? quel style performatif au delà d’une partition musicale figée ? Cette thèse se propose de révéler et expliquer les complexes relations entre les domaines audio et symbolique afin de réduire cet écart grâce à l’étude fine de l’inhérente temporalité contenue dans les enregistrements vocaux. Au coeur de cet objectif, se trouve la tâche d’alignement de voix qui vise à déterminer l’occurrence temporelle de symboles supposés présents dans un signal vocal. Ces travaux s’intéressent tout particulièrement au développement d’un modèle acoustique, ADAGIO, capable d’estimer de tels liens temps-symboles. Les récents progrès en apprentissage profond amènent à implémenter ADAGIO sous la forme d’un réseau de neurones profond dans un puissant formalisme générique : la “Classification Temporelle Connectioniste” (CTC). Cependant, la grande flexibilité offerte par la CTC est mise en défaut par son absence intrinsèque de garanties de prédictions temporellement précises. Les contributions clefs de cette recherche visent à renforcer la CTC par des contraintes temporelles supplémentaires pour améliorer la qualité des alignements déduits. Pour cela, trois tâches annexes de (1) reconstruction du contenu spectral, (2) propagation de la structure audio, et (3) monotonie guidée sont introduites et induisent un impact positif sur l’alignement entre voix, textes, et notes. Dès lors, ADAGIO contribue à de nombreuses applications pratiques au travers de collaborations telles que la synthèse vocale concaténative ou l’étude des stratégies de production expressives en jeu tant pour les attitudes sociales dans la parole que pour le style de chant dans des performances musicales.

On temporal constraints for deep neural voice alignment

Étude de contraintes temporelles pour l'alignement de voix par apprentissage profond

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager