We conducted a comparative analytic study on the context-dependent Gaussian mixture hidden Markov model (CD-GMM-HMM) and deep neural network hidden Markov model (CD-DNN-HMM) with respect to the phone discrimination and the robustness performance. We found that the DNN can significantly improve the phone recognition performance for every phoneme with 15.6% to 39.8% relative phone error rate reduction (PERR). It is particularly good at discriminating certain consonants, which are found to be hard in the GMM. On the robustness side, the DNN outperforms the GMM at all SNR levels, across different devices, and under all speaking rate with nearly uniform improvement. The performance gap with respect to different SNR levels, distinct channels, and varied speaking rate remains large. For example, in CD-DNN-HMM, we observed 1~2% performance degradation per 1dB SNR drop; 20~25% performance gap between the best and least well performed devices; 15~30% relative word error rate increase when the speaking rate speeds up or slows down by 30% from the sweet spot. Therefore, we conclude the robustness remains to be a major challenge in the deep learning acoustic model. Speech enhancement, channel normalization, and speaking rate compensation are important research areas in order to further improve the DNN model accuracy.
Bibliographic reference. Huang, Yan / Yu, Dong / Liu, Chaojun / Gong, Yifan (2014): "A comparative analytic study on the Gaussian mixture and context dependent deep neural network hidden Markov models", In INTERSPEECH-2014, 1895-1899.