International Workshop on Spoken Language Translation (IWSLT) 2012
Recently we have shown that the context-dependent deep neural network
(DNN) hidden Markov model (CD-DNN-HMM) can do surprisingly well for
large vocabulary speech recognition (LVSR) as demonstrated on several
benchmark tasks. Since then, much work has been done to understand its potential
and to further advance the state of the art. In this talk I will share some of these thoughts
and introduce some of the recent progresses we have made.
In the talk, I will first briefly describe CD-DNN-HMM and bring some insights on why DNNs can do better than the shallow neural networks and Gaussian mixture models. My discussion will be based on the fact that DNN can be considered as a joint model of a complicated feature extractor and a log-linear model. I will then describe how some of the obstacles, such as training speed, decoding speed, sequence-level training, and adaptation, on adopting CD-DNN-HMMs can be removed thanks to recent advances. After that, I will show ways to further improve the DNN structures to achieve better recognition accuracy and to support new scenarios. I will conclude the talk by indicating that DNNs not only do better but also are simpler than GMMs.
Full Paper Presentation
Bibliographic reference. Yu, Dong (2012): "Who can understand your speech better - deep neural network or Gaussian mixture model?", In IWSLT-2012, 9.