ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models

Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe

Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. With recent state-of-the-art attention-based automatic speech recognition (ASR) structure, NAR can realize promising real-time factor (RTF) improvement with only small degradation of accuracy compared to the autoregressive (AR) models. However, the recognition inference needs to wait for the completion of a full speech utterance, which limits their applications on low latency scenarios. To address this issue, we propose a novel end-to-end streaming NAR speech recognition system by combining blockwise-attention and connectionist temporal classification with mask-predict (Mask-CTC) NAR. During inference, the input audio is separated into small blocks and then processed in a blockwise streaming way. To address the insertion and deletion error at the edge of the output of each block, we apply an overlapping decoding strategy with a dynamic mapping trick that can produce more coherent sentences. Experimental results show that the proposed method improves online ASR recognition in low latency conditions compared to vanilla Mask-CTC. Moreover, it can achieve a much faster inference speed compared to the AR attention-based models. All of our codes will be publicly available.


doi: 10.21437/Interspeech.2021-1556

Cite as: Wang, T., Fujita, Y., Chang, X., Watanabe, S. (2021) Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models. Proc. Interspeech 2021, 3755-3759, doi: 10.21437/Interspeech.2021-1556

@inproceedings{wang21ba_interspeech,
  author={Tianzi Wang and Yuya Fujita and Xuankai Chang and Shinji Watanabe},
  title={{Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3755--3759},
  doi={10.21437/Interspeech.2021-1556}
}