ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Toward Streaming ASR with Non-Autoregressive Insertion-Based Model

Yuya Fujita, Tianzi Wang, Shinji Watanabe, Motoi Omachi

Neural end-to-end (E2E) models have become a promising technique to realize practical automatic speech recognition (ASR) systems. When realizing such a system, one important issue is the segmentation of audio to deal with streaming input or long recording. After audio segmentation, the ASR model with a small real-time factor (RTF) is preferable because the latency of the system can be faster. Recently, E2E ASR based on non-autoregressive models becomes a promising approach since it can decode an N-length token sequence with less than N iterations. We propose a system to concatenate audio segmentation and non-autoregressive ASR to realize high accuracy and low RTF ASR. As a non-autoregressive ASR, the insertion-based model is used. In addition, instead of concatenating separated models for segmentation and ASR, we introduce a new architecture that realizes audio segmentation and non-autoregressive ASR by a single neural network. Experimental results on Japanese and English dataset show that the method achieved a reasonable trade-off between accuracy and RTF compared with baseline autoregressive Transformer and connectionist temporal classification.


doi: 10.21437/Interspeech.2021-1131

Cite as: Fujita, Y., Wang, T., Watanabe, S., Omachi, M. (2021) Toward Streaming ASR with Non-Autoregressive Insertion-Based Model. Proc. Interspeech 2021, 3740-3744, doi: 10.21437/Interspeech.2021-1131

@inproceedings{fujita21b_interspeech,
  author={Yuya Fujita and Tianzi Wang and Shinji Watanabe and Motoi Omachi},
  title={{Toward Streaming ASR with Non-Autoregressive Insertion-Based Model}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={3740--3744},
  doi={10.21437/Interspeech.2021-1131}
}