Abstract : Current automatic speech recognition (ASR) systems are based on language models (LM) which gather word sequence probabilities (n-gram probabilities) and assist the system in discriminating utterances with the highest likelihood. In practice, these ngram probabilities are estimated once and for all on large multitopic corpora based on a fixed, though large, general-purpose vocabulary. Hence, current systems suffer from a lack of specificity when dealing with topic-specific spoken documents. To circumvent this problem, we propose to modify the LM and the vocabulary through a new unsupervised topic-based adaptation scheme. Based on the sole automatic transcription of a thematically consistent broadcast segment, the process consists in automatically retrieving topic-specific texts on the Internet from which the LM probabilities are re-estimated and the vocabulary is enriched. By running a new transcription process, the use of these adapted components is finally expected to improve the segment recognition accuracy. This work is especially original since it avoids using any a priori knowledge about encountered topics and it integrates natural language processing techniques. In addition, we brought contributions to each step of the adaptation process. First, given a first-pass automatic transcript segment, we propose to adapt indexing methods from the information retrieval domain, namely tf-idf , to the specifics of automatic transcription (no case, potentially erroneous words, etc.) in order to characterize the encountered topic by a set of keywords. By submitting these keywords to Web search engines, Web pages are then retrieved and thematically filtered to guarantee a good topic similarity with the transcript segment. Second, we developed an original topic-based LM re-estimation technique based on the minimum discrimination information LM adaptation framework and on topic-specific words and phrases automatically extracted from Web corpora. This enables us to exclusively adapt LM n-gram probabilities related the topic of the segment, while other, general-purpose, n-gram probabilities are kept untouched. Third, topic-specific Web corpora can be used to spot out-of-vocabulary topic-specific words to be added to the ASR system vocabulary and LM. Whereas adding such words into the vocabulary is straightforward, their integration into a pre-existent LM is more complex. We thus proposed to achieve this task by building n-grams for each new word thanks to its paradigmatic relations with other words and thanks to the combined information about the usage of these latter words in the pre-existent LM. Experiments done on French-speaking broadcast news show that our whole topic-specific adaptation process yields significant recognition accuracy improvements of an ASR system.
Interface homme-machine [cs.HC]. INSA de Rennes, 2010. Français
