Multi-Task Multi-Resolution Char-to-BPE Cross-Attention Decoder for End-to-End Speech Recognition

Dhananjaya Gowda, Abhinav Garg, Kwangyoun Kim, Mehul Kumar, Chanwoo Kim


In this paper we present a new hierarchical character to byte-pair encoding (C2B) end-to-end neural network architecture for improving the performance of attention based encoder-decoder ASR models. We explore different strategies for building the hierarchical C2B models such as building the individual blocks one at a time, as well as training the entire model as a monolith in a single step. We show that C2B model trained simultaneously with four losses, two for character and two for BPE sequences help regularize the learning of character sequences as well as BPE sequences. The proposed multi-task multi-resolution hierarchical architecture improves the WER of a small footprint bidirectional full-attention E2E model on the 960 hours LibriSpeech corpus by around 15% relative and is comparable to the state-of-the-art performance of an almost 3 times bigger model on the same dataset.


 DOI: 10.21437/Interspeech.2019-3216

Cite as: Gowda, D., Garg, A., Kim, K., Kumar, M., Kim, C. (2019) Multi-Task Multi-Resolution Char-to-BPE Cross-Attention Decoder for End-to-End Speech Recognition. Proc. Interspeech 2019, 2783-2787, DOI: 10.21437/Interspeech.2019-3216.


@inproceedings{Gowda2019,
  author={Dhananjaya Gowda and Abhinav Garg and Kwangyoun Kim and Mehul Kumar and Chanwoo Kim},
  title={{Multi-Task Multi-Resolution Char-to-BPE Cross-Attention Decoder for End-to-End Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2783--2787},
  doi={10.21437/Interspeech.2019-3216},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3216}
}