This paper examines the problem of estimating stream weights for a multistream audio-visual speech recogniser in the context of a simultaneous speaker task. The task is challenging because signal-tonoise ratio (SNR) cannot be readily inferred from the acoustics alone. The method proposed employs artificial neural networks (ANNs) to estimate the SNR from HMM state-likelihoods. SNR is converted to stream weight using a mapping optimised on development data. The method produces an audio-visual recognition performance better than that of both the audio-only and the video-only baselines across a wide range of SNRs. The performance using SNR estimates based on audio state-likelihoods is compared to that obtained using both audio and visual likelihoods. Although the audio-visual SNR estimator outperforms the audio-only SNR estimator, the recognition performance benefit is small. Ideas for making fuller use of the visual information are discussed.
Cite as: Shao, X., Barker, J. (2006) Audio-visual speech recognition in the presence of a competing speaker. Proc. Interspeech 2006, paper 1589-Tue3WeO.6, doi: 10.21437/Interspeech.2006-380
@inproceedings{shao06_interspeech, author={Xu Shao and Jon Barker}, title={{Audio-visual speech recognition in the presence of a competing speaker}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1589-Tue3WeO.6}, doi={10.21437/Interspeech.2006-380} }