ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models

Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao

Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers’ predictions. However, the performance gap between teacher and student WERs remains high. In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). In particular, we show that, despite being weaker than RNN-T models, CTC models are remarkable teachers. Further, by fusing RNN-T and CTC models together, we build the strongest teachers. The resulting student models drastically improve upon streaming models of previous work [1]: the WER decreases by 41% on Spanish, 27% on Portuguese, and 13% on French.


doi: 10.21437/Interspeech.2021-637

Cite as: Doutre, T., Han, W., Chiu, C.-C., Pang, R., Siohan, O., Cao, L. (2021) Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models. Proc. Interspeech 2021, 1807-1811, doi: 10.21437/Interspeech.2021-637

@inproceedings{doutre21_interspeech,
  author={Thibault Doutre and Wei Han and Chung-Cheng Chiu and Ruoming Pang and Olivier Siohan and Liangliang Cao},
  title={{Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1807--1811},
  doi={10.21437/Interspeech.2021-637}
}