![]() |
International Workshop on Spoken Language Translation (IWSLT) 2012Hong Kong |
![]() |
Recently we have shown that the context-dependent deep neural network
(DNN) hidden Markov model (CD-DNN-HMM) can do surprisingly well for
large vocabulary speech recognition (LVSR) as demonstrated on several
benchmark tasks. Since then, much work has been done to understand its potential
and to further advance the state of the art. In this talk I will share some of these thoughts
and introduce some of the recent progresses we have made.
In the talk, I will first briefly describe CD-DNN-HMM and
bring some insights on why
DNNs can do better than the shallow neural networks and Gaussian mixture models.
My discussion will be based on the fact that DNN can be considered as a joint model of
a complicated feature extractor and a log-linear model. I will then describe how some of
the obstacles, such as training speed, decoding speed, sequence-level training, and
adaptation, on adopting CD-DNN-HMMs can be removed thanks to recent advances.
After that, I will show ways to further improve the DNN structures to achieve
better recognition accuracy and to support new scenarios. I will conclude the talk by
indicating that DNNs not only do better but also are simpler than GMMs.
Bibliographic reference. Yu, Dong (2012): "Who can understand your speech better - deep neural network or Gaussian mixture model?", In IWSLT-2012, 9.