Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings

Prakhar Swarup, Roland Maas, Sri Garimella, Sri Harish Mallidi, Björn Hoffmeister

In automatic speech recognition, confidence measures provide a quantitative representation used to assess whether a generated hypothesis text is correct or not. For personal assistant devices like Alexa, automatic speech recognition (ASR) errors are inevitable due to the imperfection of today’s speech recognition technology. Hence, confidence scores provide an important metric to gauge the correctness of ASR hypothesis text and enable downstream consumers to subsequently initiate appropriate actions. In this work, our aim is to improve the correctness of our confidence scores by enhancing our baseline model architecture with learned features, namely acoustic and 1-best hypothesis embeddings. These embeddings are obtained by training separate networks on acoustic features and ASR 1-best hypothesis respectively. We present an experimental evaluation on a large US English data set showing a 6% relative equal error rate reduction and 13% relative normalized cross-entropy improvement over our baseline system by incorporating these embeddings. We also present a deeper analysis of the embeddings revealing that the acoustic embedding results in a better prediction of insertion errors whereas the 1-best hypothesis embedding helps to better predict substitution errors.

 DOI: 10.21437/Interspeech.2019-1241

Cite as: Swarup, P., Maas, R., Garimella, S., Mallidi, S.H., Hoffmeister, B. (2019) Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings. Proc. Interspeech 2019, 2175-2179, DOI: 10.21437/Interspeech.2019-1241.

  author={Prakhar Swarup and Roland Maas and Sri Garimella and Sri Harish Mallidi and Björn Hoffmeister},
  title={{Improving ASR Confidence Scores for Alexa Using Acoustic and Hypothesis Embeddings}},
  booktitle={Proc. Interspeech 2019},