International Workshop on Spoken Language Translation (IWSLT) 2012

Hong Kong
December 6-7, 2012

Who Can Understand Your Speech Better - Deep Neural Network or Gaussian Mixture Model?

Dong Yu

Microsoft Research, Redmond, WA, USA

Recently we have shown that the context-dependent deep neural network (DNN) hidden Markov model (CD-DNN-HMM) can do surprisingly well for large vocabulary speech recognition (LVSR) as demonstrated on several benchmark tasks. Since then, much work has been done to understand its potential and to further advance the state of the art. In this talk I will share some of these thoughts and introduce some of the recent progresses we have made.
    In the talk, I will first briefly describe CD-DNN-HMM and bring some insights on why DNNs can do better than the shallow neural networks and Gaussian mixture models. My discussion will be based on the fact that DNN can be considered as a joint model of a complicated feature extractor and a log-linear model. I will then describe how some of the obstacles, such as training speed, decoding speed, sequence-level training, and adaptation, on adopting CD-DNN-HMMs can be removed thanks to recent advances. After that, I will show ways to further improve the DNN structures to achieve better recognition accuracy and to support new scenarios. I will conclude the talk by indicating that DNNs not only do better but also are simpler than GMMs.

Full Paper    Presentation

Bibliographic reference.  Yu, Dong (2012): "Who can understand your speech better - deep neural network or Gaussian mixture model?", In IWSLT-2012, 9.