Auditory-Visual Speech Processing 2007 (AVSP2007)

Kasteel Groenendaal, Hilvarenbeek, The Netherlands
August 31 - September 3, 2007

Weighting and Normalisation of Synchronous HMMs for Audio-Visual Speech Recognition

David Dean (1), Patrick Lucey (1), Sridha Sridharan (1), Tim Wark (1,2)

(1) Speech, Audio, Image and Video Research Laboratory, Queensland University of Technology; (2) CSIRO ICT Centre; Brisbane, Australia

In this paper, we examine the effect of varying the stream weights in synchronous multi-stream hidden Markov models (HMMs) for audio-visual speech recognition. Rather than considering the stream weights to be the same for training and testing, we examine the effect of different stream weights for each task on the final speech-recognition performance. Evaluating our system under varying levels of audio and video degradation on the XM2VTS database, we show that the final performance is primarily a function of the choice of stream weight used in testing, and that the choice of stream weight used for training has a very minor corresponding effect. By varying the value of the testing stream weights we show that the best average speech recognition performance occurs with the streams weighted at around 80% audio and 20% video. However, by examining the distribution of frame-by-frame scores for each stream on a leftout section of the database, we show that these testing weights chosen primarily serve to normalise the two stream score distributions, rather than indicating the dependence of the final performance on either stream. By using a novel adaption of zero-normalisation to normalise each streamís models before performing the weighted-fusion, we show that the actual contribution of the audio and video scores to the best performing speech system is closer to equal that appears to be indicated by the un-normalised stream weighting parameters alone.

Full Paper

Bibliographic reference.  Dean, David / Lucey, Patrick / Sridharan, Sridha / Wark, Tim (2007): "Weighting and normalisation of synchronous HMMs for audio-visual speech recognition", In AVSP-2007, paper P28.