Confidence scores are very useful for downstream applications of automatic speech recognition (ASR) systems. Recent works have proposed using neural networks to learn word or utterance confidence scores for end-to-end ASR. In those studies, word confidence by itself does not model deletions, and utterance confidence does not take advantage of word-level training signals. This paper proposes to jointly learn word confidence, word deletion, and utterance confidence. Empirical results show that multi-task learning with all three objectives improves confidence metrics (NCE, AUC, RMSE) without the need for increasing the model size of the confidence estimation module. Using the utterance-level confidence for rescoring also decreases the word error rates on Google’s Voice Search and Long-tail Maps datasets by 3–5% relative, without needing a dedicated neural rescorer.
Cite as: Qiu, D., He, Y., Li, Q., Zhang, Y., Cao, L., McGraw, I. (2021) Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction. Proc. Interspeech 2021, 4074-4078, doi: 10.21437/Interspeech.2021-1207
@inproceedings{qiu21b_interspeech, author={David Qiu and Yanzhang He and Qiujia Li and Yu Zhang and Liangliang Cao and Ian McGraw}, title={{Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={4074--4078}, doi={10.21437/Interspeech.2021-1207} }