The Context-Dependent Deep-Neural-Network HMM, or CDDNN-HMM,
is a recently proposed acoustic-modeling technique
for HMM-based speech recognition that can greatly outperform
conventional Gaussian-mixture based HMMs. For example,
a CD-DNN-HMM trained on the 2000h Fisher corpus
achieves 14.4% word error rate on the Hub5'00-FSH speakerindependent
phone-call transcription task, compared to 19.6%
obtained by a state-of-the-art, conventional discriminatively
trained GMM-based HMM.
That CD-DNN-HMM, however, took 59 days to train on a modern GPGPU the immense computational cost of the minibatch based back-propagation (BP) training is a major roadblock. Unlike the familiar Baum-Welch training for conventional HMMs, BP cannot be efficiently parallelized across data.
In this paper we show that the pipelined approximation to BP, which parallelizes computation with respect to layers, is an efficient way of utilizing multiple GPGPU cards in a single server. Using 2 and 4 GPGPUs, we achieve a 1.9 and 3.3 times end-to-end speed-up, at parallelization efficiency of 0.95 and 0.82, respectively, at no loss of recognition accuracy.
Index Terms: speech recognition, deep neural networks, parallelization, GPGPU
Bibliographic reference. Chen, Xie / Eversole, Adam / Li, Gang / Yu, Dong / Seide, Frank (2012): "Pipelined back-propagation for context-dependent deep neural networks", In INTERSPEECH-2012, 26-29.