A major goal of current speech recognition research is to improve the robustness of recognition systems used in noisy environments. Recent strides in computing technology have allowed consideration of systems that use visual information to augment the decision capability of the recognizer, allowing superior performance in these difficult environments. A crucial area of research in audiovisual speech recognition is how to combine the separate modes of information. Late integration, an approach whereby separate audio-based and video-based decisions are made and then combined "late" in the process, has emerged as one of the simplest yet most effective techniques. Research has suggested that the fusion method for this technique (and similar methods such as multi-stream HMMs) is affected somewhat by the level of interfering audio noise. This paper further defines the relationship between data fusion in the presence of audio noise and demonstrates that optimal data fusion can only be performed if both the noise level and type are considered.
Cite as: Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.N. (2001) Noise-based audio-visual fusion for robust speech recognition. Proc. Auditory-Visual Speech Processing, 195-198
@inproceedings{patterson01_avsp, author={E. K. Patterson and S. Gurbuz and Z. Tufekci and J. N. Gowdy}, title={{Noise-based audio-visual fusion for robust speech recognition}}, year=2001, booktitle={Proc. Auditory-Visual Speech Processing}, pages={195--198} }