Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

Takaaki Hori, Shinji Watanabe, Yu Zhang, William Chan


We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions, the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5–10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.


 DOI: 10.21437/Interspeech.2017-1296

Cite as: Hori, T., Watanabe, S., Zhang, Y., Chan, W. (2017) Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM. Proc. Interspeech 2017, 949-953, DOI: 10.21437/Interspeech.2017-1296.


@inproceedings{Hori2017,
  author={Takaaki Hori and Shinji Watanabe and Yu Zhang and William Chan},
  title={Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={949--953},
  doi={10.21437/Interspeech.2017-1296},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1296}
}