ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Multitask Training with Text Data for End-to-End Speech Recognition

Peidong Wang, Tara N. Sainath, Ron J. Weiss

We propose a multitask training method for attention-based end-to-end speech recognition models. We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data. Trained on the 100-hour subset of LibriSpeech, the proposed method, without requiring an additional language model, leads to an 11% relative performance improvement over the baseline and approaches the performance of language model shallow fusion on the test-clean evaluation set. We observe a similar trend on the whole 960-hour LibriSpeech training set. Analyses of different types of errors and sample output sentences demonstrate that the proposed method can incorporate language level information, suggesting its effectiveness in real-world applications.


doi: 10.21437/Interspeech.2021-683

Cite as: Wang, P., Sainath, T.N., Weiss, R.J. (2021) Multitask Training with Text Data for End-to-End Speech Recognition. Proc. Interspeech 2021, 2566-2570, doi: 10.21437/Interspeech.2021-683

@inproceedings{wang21t_interspeech,
  author={Peidong Wang and Tara N. Sainath and Ron J. Weiss},
  title={{Multitask Training with Text Data for End-to-End Speech Recognition}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={2566--2570},
  doi={10.21437/Interspeech.2021-683}
}