ISCA Archive Interspeech 2020
ISCA Archive Interspeech 2020

ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context

Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu

Convolutional neural networks (CNN) have shown promising results for end-to-end speech recognition, albeit still behind RNN/transformer based models in performance. In this paper, we study how to bridge this gap and go beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet. ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules. In addition, we propose a simple scaling method that scales the widths of ContextNet that achieves good trade-off between computation and accuracy.

We demonstrate that on the widely used Librispeech benchmark, ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets. This compares to the best previously published model of 2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the proposed ContextNet model is also verified on a much larger internal dataset.


doi: 10.21437/Interspeech.2020-2059

Cite as: Han, W., Zhang, Z., Zhang, Y., Yu, J., Chiu, C.-C., Qin, J., Gulati, A., Pang, R., Wu, Y. (2020) ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. Proc. Interspeech 2020, 3610-3614, doi: 10.21437/Interspeech.2020-2059

@inproceedings{han20_interspeech,
  author={Wei Han and Zhengdong Zhang and Yu Zhang and Jiahui Yu and Chung-Cheng Chiu and James Qin and Anmol Gulati and Ruoming Pang and Yonghui Wu},
  title={{ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3610--3614},
  doi={10.21437/Interspeech.2020-2059}
}