10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Combined Discriminative Training for Multi-Stream HMM-Based Audio-Visual Speech Recognition

Jing Huang (1), Karthik Visweswariah (2)

(1) IBM T.J. Watson Research Center, USA
(2) IBM India Research Lab, India

In this paper we investigate discriminative training of models and feature space for a multi-stream hidden Markov model (HMM) based audio-visual speech recognizer (AVSR). Since the two streams are used together in decoding, we propose to train the parameters of the two streams jointly. This is in contrast to prior work which has considered discriminative training of parameters in each stream independent of the other. In experiments on a 20-speaker one-hour speaker independent test set, we obtain 22% relative gain on AVSR performance over A/V models whose parameters are trained separately, and 50% relative gain on AVSR over the baseline maximum-likelihood models. On a noisy (mismatched to training) test set, we obtain 21% relative gain over A/V models whose parameters are trained separately. This represents 30% relative improvement over the maximum-likelihood baseline.

Full Paper

Bibliographic reference.  Huang, Jing / Visweswariah, Karthik (2009): "Combined discriminative training for multi-stream HMM-based audio-visual speech recognition", In INTERSPEECH-2009, 1379-1382.