5th International Conference on Spoken Language Processing
We present a study of data simulated using acoustic models trained on Switchboard data, and then recognized using various Switchboard-trained models. Simple development models give a word error rate (WER) of about 47%, when recognizing real Switchboard conversations. If we simulate speech from word transcriptions, obtaining the word pronunciations from our recognition dictionary, the WER drops by a factor of five to ten. If we use more realistic hand-labeled phonetic transcripts to fabricate data, we obtain WERs in the low 40's, close to those found in actual speech data. These and other experiments we describe in the paper suggest that there is a substantial mismatch between real speech and the combination of our acoustic models and the pronunciations in our recognition dictionary. The use of simulation in speech recognition research appears to be a promising tool in our efforts to understand and reduce the size of this mismatch.
Bibliographic reference. McAllaster, Don / Gillick, Lawrence / Scattone, Francesco / Newman, Michael (1998): "Fabricating conversational speech data with acoustic models: a program to examine model-data mismatch", In ICSLP-1998, paper 0986.