Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Audio-Visual Speech Recognition Using MCE-Based Hmms and Model-Dependent Stream Weights

Chiyomi Miyajima, Keiichi Tokuda, Tadashi Kitamura

Department of Computer Science, Nagoya Institute of Technology, Japan

This paper presents a framework for designing a hidden Markov model (HMM)-based audio-visual automatic speech recognition (ASR) system based on minimum classification error training. Audio/visual HMM parameters are optimized with the generalized probabilistic descent (GPD) method, and their likelihoods are combined using model-dependent stream weights which are also estimated with the GPD method. Experimental results of speaker independent isolated word recognition show that the audiovisual ASR performance is significantly improved by the GPD optimization of audio and visual HMMs and the introduction of model-dependent stream weights, resulting in 47% 81% error reduction over a conventional system which consists of HMMs trained based on the maximum likelihood criterion and globally-tied stream weights estimated with the GPD method.


Full Paper

Bibliographic reference.  Miyajima, Chiyomi / Tokuda, Keiichi / Kitamura, Tadashi (2000): "Audio-visual speech recognition using MCE-based hmms and model-dependent stream weights", In ICSLP-2000, vol.2, 1023-1026.