Adaptation thématique non supervisée d'un système de reconnaissance automatique de la parole

Gwénolé Lecorvé 1
1 TEXMEX - Multimedia content-based indexing
IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires, Inria Rennes – Bretagne Atlantique
Abstract : Current automatic speech recognition (ASR) systems are based on language models (LM) which gather word sequence probabilities (n-gram probabilities) and assist the system in discriminating utterances with the highest likelihood. In practice, these ngram probabilities are estimated once and for all on large multitopic corpora based on a fixed, though large, general-purpose vocabulary. Hence, current systems suffer from a lack of specificity when dealing with topic-specific spoken documents. To circumvent this problem, we propose to modify the LM and the vocabulary through a new unsupervised topic-based adaptation scheme. Based on the sole automatic transcription of a thematically consistent broadcast segment, the process consists in automatically retrieving topic-specific texts on the Internet from which the LM probabilities are re-estimated and the vocabulary is enriched. By running a new transcription process, the use of these adapted components is finally expected to improve the segment recognition accuracy. This work is especially original since it avoids using any a priori knowledge about encountered topics and it integrates natural language processing techniques. In addition, we brought contributions to each step of the adaptation process. First, given a first-pass automatic transcript segment, we propose to adapt indexing methods from the information retrieval domain, namely tf-idf , to the specifics of automatic transcription (no case, potentially erroneous words, etc.) in order to characterize the encountered topic by a set of keywords. By submitting these keywords to Web search engines, Web pages are then retrieved and thematically filtered to guarantee a good topic similarity with the transcript segment. Second, we developed an original topic-based LM re-estimation technique based on the minimum discrimination information LM adaptation framework and on topic-specific words and phrases automatically extracted from Web corpora. This enables us to exclusively adapt LM n-gram probabilities related the topic of the segment, while other, general-purpose, n-gram probabilities are kept untouched. Third, topic-specific Web corpora can be used to spot out-of-vocabulary topic-specific words to be added to the ASR system vocabulary and LM. Whereas adding such words into the vocabulary is straightforward, their integration into a pre-existent LM is more complex. We thus proposed to achieve this task by building n-grams for each new word thanks to its paradigmatic relations with other words and thanks to the combined information about the usage of these latter words in the pre-existent LM. Experiments done on French-speaking broadcast news show that our whole topic-specific adaptation process yields significant recognition accuracy improvements of an ASR system.
Document type :
Human-Computer Interaction [cs.HC]. INSA de Rennes, 2010. French
Contributor : Patrick Gros <>
Submitted on : Thursday, February 17, 2011 - 10:56:45 AM
Last modification on : Monday, May 18, 2015 - 1:10:38 AM


  • HAL Id : tel-00566824, version 1



Gwénolé Lecorvé. Adaptation thématique non supervisée d'un système de reconnaissance automatique de la parole. Human-Computer Interaction [cs.HC]. INSA de Rennes, 2010. French. <tel-00566824>




Consultation de
la notice


Téléchargement du document