Robust and accurate speech recognition systems can only be realized with adequately trained acoustic models. For common languages, state-of-the-art systems are now trained on thousands of hours of speech data. Even with a large cluster of machines the entire training process can take many weeks. To overcome this development bottleneck we propose a new framework for rapid training of acoustic models using highly parallel graphics processing units (GPUs). In this paper we focus on Viterbi training and describe the optimizations required for effective throughput on GPU processors. Using a single NVIDIA GTX580 GPU our proposed approach is shown to be 51#215; faster than a sequential CPU implementation, enabling a moderately sized acoustic model to be trained on 1000 hours of speech data in just over 9 hours. Moreover, we show that our implementation on a two-GPU system can perform 67% faster than a standard parallel reference implementation on a high-end 32-core Xeon server. Our GPU-based training platform empowers research groups to rapidly evaluate new ideas and build accurate and robust acoustic models on very large training corpora.
Bibliographic reference. Buthpitiya, Senaka / Lane, Ian / Chong, Jike (2011): "Rapid training of acoustic models using graphics processing unit", In INTERSPEECH-2011, 793-796.