Sixth International Conference on Spoken Language Processing
In order to detect misrecognitions that may result from a mismatch between training and testing data, we use a con- fidence measure (CM) that collects a set of features during recognition and from the N-best list that is output by the recognizer. A neural network (NN) then calculates the probability that the utterance was recognized correctly based on these features. Since for misrecognized utterances the resulting phoneme alignments are often erroneous, we introduced some new features that are based on phoneme durations. The durations found by the recognizer are compared to the durations present in the training data base and the results of these comparisons serve as input for the NN. A great advantage of the duration-related features is that they are independent of the recognizer in contrast to e.g. acoustic scorebased features. We also use some score-related features that have proven to be useful in the past. Simultaneously with determining the confidence for a recognition result, we try to detect if in case of a misrecognition the utterance was an out of vocabulary (OOV) utterance. Using the complete set of 46 features we can achieve a correct classification rate of 90%. The word error rate can be reduced by 92% at a false rejection rate of 5.1% on a test task that consists of 35 speakers and includes more than 50% OOV utterances. OOV words were detected correctly in 91% of the cases. The presented CM is also used in a semi-supervised speaker adaptation scheme.
Bibliographic reference. Goronzy, Silke / Marasek, Krzysztof / Kompe, Ralf / Haag, Andreas (2000): "Phone-duration-based confidence measures for embedded applications", In ICSLP-2000, vol.4, 500-503.