We present results that show it is possible to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125,000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex, state-of-the-art baseline with sub-word units.
Cite as: Soltau, H., Liao, H., Sak, H. (2017) Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition. Proc. Interspeech 2017, 3707-3711, doi: 10.21437/Interspeech.2017-1566
@inproceedings{soltau17_interspeech, author={Hagen Soltau and Hank Liao and Haşim Sak}, title={{Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3707--3711}, doi={10.21437/Interspeech.2017-1566} }