Third International Conference on Spoken Language Processing (ICSLP 94)

Yokohama, Japan
September 18-22, 1994

Estimating Recognition Confidence: Methods for Conjoining Acoustics, Semantics, Pragmatics and Discourse

Sheryl R. Young

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

This paper describes and evaluates a new technique for measuring confidence in word strings produced by speech recognition systems. It detects misrecognized and out-of-yocabulary words in spontaneous spoken utterances and dialogs using multiple stochastic and symbolic knowledge sources including acoustics, semantics, pragmatics and discourse structure. The work is part of a larger effort to automatically recognize and understand new words when spoken. The system described combines newly developed acoustic confidence measures with the semantic, pragmatic and discourse structure knowledge embodied in the MENDS-n system. The acoustic confidence metrics output independent probabilites that a word is recognized correctly and measure how reliably we can estimate if a word is wrong. The acoustic confidence metrics are derived from normalized acoustic recognition scores. Acoustic scores are normalized by estimates of the denomiator of Bayes equation. To evaluate the utility of using the acoustic techniques together with higher-level constraints, the preliminary system restricted component interaction. Words with normalized acoustic scores that had a 95% or greater probability of being incorrect were flagged prior to being input to the mnds-ii analysis module. For this study, MNDS-n independently used its higher-level knowledge to detect recognition errors that were semantically or contextually inappropriate. Misrecognized word strings were then re-recognized using an RTN-based speech decoder and dynamically derived language model that biases against recognition of illogical and highly improbable content. The dynamically derived grammars restrict the words that can be matched during recognition, reducing perplexity by defining a set of semantic content predictions for the word string. A grammar is derived for each misrecognized word string encountered within an utterance. Speaker goals and plans, contextual appropriateness and structural characteristics of discourse and spontaneous speech are all considered in the derivation of grammars. The results indicate that the conjoined usage of acoustic confidence measures of accuracy and higher-level constraints increased ability to detect misrecognitions by 36% and enabled the larger system to overcomes the weaknesses of the individual techniques. The techniques detect complementary phenomena. The acoustic methods detect important, misrecognized content words. They cannot reliably estimate recognition accuracy for most small or confusible words. The higher-level constraint methods cannot detect contextually consistent misrecogntions, but can detect errors caused by confusible content words, restarts and mid-utterance corrections. Current work focuses upon development of more sophisticated techniques for conjoining these two methods and techniques to use acoustic confidence measures during decoding.

Full Paper

Bibliographic reference.  Young, Sheryl R. (1994): "Estimating recognition confidence: methods for conjoining acoustics, semantics, pragmatics and discourse", In ICSLP-1994, 2159-2162.