Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition

Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Ta Li, Yonghong Yan


The hybrid CTC/attention end-to-end automatic speech recognition (ASR) combines CTC ASR system and attention ASR system into a single neural network. Although the hybrid CTC/attention ASR system takes the advantages of both CTC and attention architectures in training and decoding, it remains challenging to be used for streaming speech recognition for its attention mechanism, CTC prefix probability and bidirectional encoder. In this paper, we propose a stable monotonic chunkwise attention (sMoChA) to stream its attention branch and a truncated CTC prefix probability (T-CTC) to stream its CTC branch. On the acoustic model side, we utilize the latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream its encoder. On the joint CTC/attention decoding side, we propose the dynamic waiting joint decoding (DWDJ) algorithm to collect the decoding hypotheses from the CTC and attention branches. Through the combination of the above methods, we stream the hybrid CTC/attention ASR system without much word error rate degradation.


 DOI: 10.21437/Interspeech.2019-2018

Cite as: Miao, H., Cheng, G., Zhang, P., Li, T., Yan, Y. (2019) Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition. Proc. Interspeech 2019, 2623-2627, DOI: 10.21437/Interspeech.2019-2018.


@inproceedings{Miao2019,
  author={Haoran Miao and Gaofeng Cheng and Pengyuan Zhang and Ta Li and Yonghong Yan},
  title={{Online Hybrid CTC/Attention Architecture for End-to-End Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2623--2627},
  doi={10.21437/Interspeech.2019-2018},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2018}
}