Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

Hagen Soltau, Hank Liao, Haşim Sak


We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.


 DOI: 10.21437/Interspeech.2017-1566

Cite as: Soltau, H., Liao, H., Sak, H. (2017) Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition. Proc. Interspeech 2017, 3707-3711, DOI: 10.21437/Interspeech.2017-1566.


@inproceedings{Soltau2017,
  author={Hagen Soltau and Hank Liao and Haşim Sak},
  title={Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3707--3711},
  doi={10.21437/Interspeech.2017-1566},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1566}
}