Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning

Thomas Hueber; Eric Tatulli; Laurent Girin; Jean-Luc Schwartz

doi:10.1162/neco_a_01264

Journal Articles Neural Computation Year : 2020

Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning

(1) , (1) , (1) , (2)

1
2

Thomas Hueber

Function : Author
PersonId : 5965
IdHAL : thomas-hueber
ORCID : 0000-0002-8296-5177
IdRef : 143151568

GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing

Eric Tatulli

Function : Author

GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing

Laurent Girin

Function : Author
PersonId : 3682
IdHAL : laurent-girin
ORCID : 0000-0002-9214-8760
IdRef : 088998037

GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing

Jean-Luc Schwartz

Function : Author
PersonId : 1160
IdHAL : jean-luc-schwartz
ORCID : 0000-0001-8969-9185
IdRef : 033230374

GIPSA - Perception, Contrôle, Multimodalité et Dynamiques de la parole

Abstract

Sensory processing is increasingly conceived in a predictive framework in which neurons would constantly process the error signal resulting from the comparison of expected and observed stimuli. Surprisingly, few data exist on the accuracy of predictions that can be computed in real sensory scenes. Here, we focus on the sensory processing of auditory and audiovisual speech. We propose a set of computational models based on artificial neural networks (mixing deep feedforward and convolutional networks), which are trained to predict future audio observations from present and past audio or audiovisual observations (i.e., including lip movements). Those predictions exploit purely local phonetic regularities with no explicit call to higher linguistic levels. Experiments are conducted on the multispeaker LibriSpeech audio speech database (around 100 hours) and on the NTCD-TIMIT audiovisual speech database (around 7 hours). They appear to be efficient in a short temporal range (25–50 ms), predicting 50% to 75% of the variance of the incoming stimulus, which could result in potentially saving up to three-quarters of the processing power. Then they quickly decrease and almost vanish after 250 ms. Adding information on the lips slightly improves predictions, with a 5% to 10% increase in explained variance. Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound.

Domains

Artificial Intelligence [cs.AI] Signal and Image processing Linguistics Machine Learning [stat.ML]

Fichier principal

Hueber.pdf (846.34 Ko)

Origin : Publisher files allowed on an open archive

Thomas Hueber : Connect in order to contact the contributor

https://hal.science/hal-03016083

Submitted on : Wednesday, November 25, 2020-4:13:06 PM

Last modification on : Thursday, April 4, 2024-9:12:43 PM

Long-term archiving on: Friday, February 26, 2021-6:25:19 PM

Dates and versions

hal-03016083 , version 1 (25-11-2020)

Identifiers

HAL Id : hal-03016083 , version 1
DOI : 10.1162/neco_a_01264

Cite

Thomas Hueber, Eric Tatulli, Laurent Girin, Jean-Luc Schwartz. Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning. Neural Computation, 2020, 32 (3), pp.596-625. ⟨10.1162/neco_a_01264⟩. ⟨hal-03016083⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UGA CNRS GIPSA GIPSA-PCMD GIPSA-CRISSP GIPSA-PPC MIAI ANR

124 View

190 Download

Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning

Abstract

Domains

Dates and versions

Identifiers

Cite

Export

Collections

Altmetric

Share