We describe the development of a multistream HMM based audiovisual speech recognition (AVSR) system and a new method for integrating the audio and visual streams using frame level posterior probabilities. This is compared to the standard feature concatenation and weighted product methods in speaker-dependent tests using our own multimodal database, by examining speech recognition robustness to corruption in either stream. For corruption in the audio stream we use additive noise at different SNR levels, and for corruption in the video stream we use MPEG4 compression at different bitrates as well as image blurring using Gaussian filters. We provide very promising results which demonstrate the robustness of the new method.
Cite as: Seymour, R., Ming, J., Stewart, D. (2005) A new posterior based audio-visual integration method for robust speech recognition. Proc. Interspeech 2005, 1229-1232, doi: 10.21437/Interspeech.2005-375
@inproceedings{seymour05_interspeech, author={Rowan Seymour and Ji Ming and Darryl Stewart}, title={{A new posterior based audio-visual integration method for robust speech recognition}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={1229--1232}, doi={10.21437/Interspeech.2005-375} }