Advancing Sequence-to-Sequence Based Speech Recognition

Zoltán Tüske, Kartik Audhkhasi, George Saon


The paper presents our endeavor to improve state-of-the-art speech recognition results using attention based neural network approaches. Our test focus was LibriSpeech, a well-known, publicly available, large, speech corpus, but the methodologies are clearly applicable to other tasks. After systematic application of standard techniques — sophisticated data augmentation, various dropout schemes, scheduled sampling, warm-restart —, and optimizing search configurations, our model achieves 4.0% and 11.7% word error rate (WER) on the test-clean and test-other sets, without any external language model. A powerful recurrent language model drops the error rate further to 2.7% and 8.2%. Thus, we not only report the lowest sequence-to-sequence model based numbers on this task to date, but our single system even challenges the best result known in the literature, namely a hybrid model together with recurrent language model rescoring. A simple ROVER combination of several of our attention based systems achieved 2.5% and 7.3% WER on the clean and other test sets.


 DOI: 10.21437/Interspeech.2019-3018

Cite as: Tüske, Z., Audhkhasi, K., Saon, G. (2019) Advancing Sequence-to-Sequence Based Speech Recognition. Proc. Interspeech 2019, 3780-3784, DOI: 10.21437/Interspeech.2019-3018.


@inproceedings{Tüske2019,
  author={Zoltán Tüske and Kartik Audhkhasi and George Saon},
  title={{Advancing Sequence-to-Sequence Based Speech Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3780--3784},
  doi={10.21437/Interspeech.2019-3018},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3018}
}