ISCA Archive IberSPEECH 2022
ISCA Archive IberSPEECH 2022

Contextual-Utterance Training for Automatic Speech Recognition

Alejandro Gomez-Alanis, Lukas Drude, Andreas Schwarz, Rupak Vignesh Swaminathan, Simon Wiesler

Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming ASR systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling “in-place” the knowledge of a teacher (non-streaming mode), which is able to see both past and future contextual utterances, to the student (streaming mode) which can only see the current and past contextual utterances. The experimental results show that a state-of-the-art conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40 ms relative, respectively.

doi: 10.21437/IberSPEECH.2022-6

Cite as: Gomez-Alanis, A., Drude, L., Schwarz, A., Swaminathan, R.V., Wiesler, S. (2022) Contextual-Utterance Training for Automatic Speech Recognition . Proc. IberSPEECH 2022, 26-30, doi: 10.21437/IberSPEECH.2022-6

  author={Alejandro Gomez-Alanis and Lukas Drude and Andreas Schwarz and Rupak Vignesh Swaminathan and Simon Wiesler},
  title={{Contextual-Utterance Training for Automatic Speech Recognition }},
  booktitle={Proc. IberSPEECH 2022},