![]() |
ASR2000 - Automatic Speech Recognition: Challenges for the new MilleniumSeptember 18-20, 2000 |
![]() |
In this paper new, phone-duration-based features for confidence measures (CMs) using a classifier are proposed. In misrecognized utterances, the segmentation and thus the phoneme durations often deviate severely from what can be observed in the training data. Also the found segmentation for one recognized phoneme often covers several ’real’ phonemes, that have different spectral properties. So such phoneme durations often indicate that a misrecognition took place and we derived some new features based on these durations. In addition to these new features we used some related to the acoustic score of the N-best hypotheses. Using the full set of 46 features we achieve a correct classification rate of 90% at a false rejection rate of 5.1% on an isolated word, command&control task using a rather simple neural network (NN) classifier. Simultaneously, we try to detect out of vocabulary (OOV) words with the same approach and succeed in 91% of the cases. We then combine this CM with unsupervised MAP and MLLR speaker adaptation. The adaptation is guided by the CM and the acoustic models are only modified if the utterance was recognized with high confidence.
Full Paper (PDF) Full Paper (Zipped Postscript)
Bibliographic reference. Goronzy, Silke / Marasek, Krzysztof / Haag, Andreas / Kompe, Ralf (2000): "Prosodically motivated features for confidence measures", In ASR-2000, 207-212.