ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei

In this paper, we propose an open source speech recognition toolkit called WeNet, in which a new two-pass approach named U2 is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and deployment of E2E speech recognition models. WeNet provides an efficient way to ship automatic speech recognition (ASR) applications in real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. We develop a hybrid connectionist temporal classification (CTC)/attention architecture with transformer or conformer as encoder and an attention decoder to rescore the CTC hypotheses. To achieve streaming and non-streaming in a unified model, we use a dynamic chunk-based attention strategy which allows the self-attention to focus on the right context with random length. Our experiments on the AISHELL-1 dataset show that our model achieves 5.03% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model achieves reasonable RTF and latency at runtime. The toolkit is publicly available.


doi: 10.21437/Interspeech.2021-1983

Cite as: Yao, Z., Wu, D., Wang, X., Zhang, B., Yu, F., Yang, C., Peng, Z., Chen, X., Xie, L., Lei, X. (2021) WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit. Proc. Interspeech 2021, 4054-4058, doi: 10.21437/Interspeech.2021-1983

@inproceedings{yao21_interspeech,
  author={Zhuoyuan Yao and Di Wu and Xiong Wang and Binbin Zhang and Fan Yu and Chao Yang and Zhendong Peng and Xiaoyu Chen and Lei Xie and Xin Lei},
  title={{WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={4054--4058},
  doi={10.21437/Interspeech.2021-1983}
}