In any dialogue manager, confidence scores play a central role in ensuring robust operation. Recently, dialogue managers have attempted to exploit N-best lists of alternatives for the semantics rather than the single most likely interpretation. Each alternative in the N-best list must have an associated confidence score and it is very useful to be able to evaluate the utility of these scored lists independent of the application in which they are used. This paper adapts several traditional metrics for confidence scoring to the context of the N-best semantic hypotheses output by a speech understanding system. An alternative metric, called the Item-level Cross Entropy (ICE), is proposed and is shown to have good theoretical and experimental characteristics. As an example of the use of the metrics, various simple methods for assigning confidences are discussed and evaluated. Of all the metrics tested only the ICE metric provided a consistent monotonic ranking of the various systems.
Bibliographic reference. Thomson, B. / Yu, K. / Gašić, M. / Keizer, S. / Mairesse, F. / Schatzmann, J. / Young, Steve (2008): "Evaluating semantic-level confidence scores with multiple hypotheses", In INTERSPEECH-2008, 1153-1156.