12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Rapid Training of Acoustic Models Using Graphics Processing Unit

Senaka Buthpitiya, Ian Lane, Jike Chong

Carnegie Mellon University, USA

Robust and accurate speech recognition systems can only be realized with adequately trained acoustic models. For common languages, state-of-the-art systems are now trained on thousands of hours of speech data. Even with a large cluster of machines the entire training process can take many weeks. To overcome this development bottleneck we propose a new framework for rapid training of acoustic models using highly parallel graphics processing units (GPUs). In this paper we focus on Viterbi training and describe the optimizations required for effective throughput on GPU processors. Using a single NVIDIA GTX580 GPU our proposed approach is shown to be 51#215; faster than a sequential CPU implementation, enabling a moderately sized acoustic model to be trained on 1000 hours of speech data in just over 9 hours. Moreover, we show that our implementation on a two-GPU system can perform 67% faster than a standard parallel reference implementation on a high-end 32-core Xeon server. Our GPU-based training platform empowers research groups to rapidly evaluate new ideas and build accurate and robust acoustic models on very large training corpora.

Full Paper

Bibliographic reference.  Buthpitiya, Senaka / Lane, Ian / Chong, Jike (2011): "Rapid training of acoustic models using graphics processing unit", In INTERSPEECH-2011, 793-796.