In context of Large-Vocabulary Continuous Speech Recognition, systems can reach a high level of performance when dealing with prepared speech, while their performance drops on spontaneous speech. This decrease is due to the fact that these two kind of speech are marked by strong acoustic and linguistic differences. Previous research works had been done to detect and repair some peculiarities of spontaneous speech, as disfluencies, and to create specific models to improve recognition accuracy: a large amount of data is needed to see improvements and is expensive to collect. In this paper, we present a solution to create specialized acoustic and language models, by automatically extracting a data subset from the initial training corpus containing spontaneous speech, and adapting initial acoustic and linguistic models on it. As we assume these models can be complementary, we propose to combine general and adapted ASR system outputs. Experimental results show statistically significant gain, for a negligible cost (no additional training data and no human intervention).
Bibliographic reference. Dufour, Richard / Bougares, Fethi / Estève, Yannick / Deléglise, Paul (2010): "Unsupervised model adaptation on targeted speech segments for LVCSR system combination", In INTERSPEECH-2010, 885-888.