Odyssey 2012 - The Speaker and Language Recognition Workshop

Singapore
June 25-28, 2012

Being Deep and Being Dynamic - New-Generation Models and Methodology for Advancing Speech Technology

Li Deng

Microsoft Research, Redmond, WA, USA

Semantic information embedded in the speech signal - not only the phonetic/linguistic content but also a full range of paralinguistic information including speaker characteristics - manifests itself in a dynamic process rooted in the deep linguistic hierarchy as an intrinsic part of the human cognitive system. Modeling both the dynamic process and the deep structure for advancing speech technology has been an active pursuit for over more than 20 years, but it is not until recently (since only a few years ago) that noticeable breakthrough has been achieved by the new methodology commonly referred to as "deep learning". Deep Belief Net (DBN) is recently being used to replace the Gaussian Mixture Model (GMM) component in HMM-based speech recognition, and has produced dramatic error rate reduction in both phone recognition and large vocabulary speech recognition while keeping the HMM component intact. On the other hand, the (constrained) Dynamic Bayesian Net (referred to as DBN* here) has been developed for many years to improve the dynamic models of speech while overcoming the IID assumption as a key weakness of the HMM, with a set of techniques and representations commonly known as hidden dynamic/trajectory models or articulatory-like models. A history of these two largely separate lines of "DBN/DBN*" research will be critically reviewed and analyzed in the context of modeling deep and dynamic linguistic hierarchy for advancing speech (as well as speaker) recognition technology. Future directions will be discussed for this exciting area of research that holds promise to build a foundation for the next-generation speech technology with human-like cognitive ability.

Bibliographic reference.  Deng, Li (2012): "Being deep and being dynamic - new-generation models and methodology for advancing speech technology", In Odyssey-2012 (abstract).