8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Audio-Visual Integration for Robust Speech Recognition Using Maximum Weighted Stream Posteriors

Rowan Seymour, Darryl Stewart, Ji Ming

Queen's University Belfast, UK

In this paper, we demonstrate for the first time, the robustness of the Maximum Stream Posterior (MSP) method for audio-visual integration on a large speaker- independent speech recognition task in noisy conditions. Furthermore, we show that the method can be generalised and improved by using a softer weighting scheme to account for moderate noise conditions. We call this generalised method the Maximum Weighted Stream Posterior (MWSP) method. In addition, we carry out the first tests of the Posterior Union Model approach for audio-visual integration. All of the methods are compared in digit recognition tests involving various audio and video noise levels and conditions including tests where both modalities are affected by noise. We also introduce a novel form of noise called

jitter which is used to simulate camera movement. The results verify that the MSP approach is robust and that its generalised form (MWSP) can lead to further improvements in moderate noise conditions.

Full Paper

Bibliographic reference.  Seymour, Rowan / Stewart, Darryl / Ming, Ji (2007): "Audio-visual integration for robust speech recognition using maximum weighted stream posteriors", In INTERSPEECH-2007, 654-657.