This article discusses strategies for end-to-end training of state-of-the-art acoustic models for Large Vocabulary Continuous Speech Recognition (LVCSR), with the goal of leveraging TensorFlow components so as to make efficient use of large-scale training sets, large model sizes, and high-speed computation units such as Graphical Processing Units (GPUs). Benchmarks are presented that evaluate the efficiency of different approaches to batching of training data, unrolling of recurrent acoustic models, and device placement of TensorFlow variables and operations. An overall training architecture developed in light of those findings is then described. The approach makes it possible to take advantage of both data parallelism and high speed computation on GPU for state-of-the-art sequence training of acoustic models. The effectiveness of the design is evaluated for different training schemes and model sizes, on a 15,000 hour Voice Search task.
Cite as: Variani, E., Bagby, T., McDermott, E., Bacchiani, M. (2017) End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow. Proc. Interspeech 2017, 1641-1645, doi: 10.21437/Interspeech.2017-1284
@inproceedings{variani17_interspeech, author={Ehsan Variani and Tom Bagby and Erik McDermott and Michiel Bacchiani}, title={{End-to-End Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition with TensorFlow}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1641--1645}, doi={10.21437/Interspeech.2017-1284} }