ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

Audio-visual speech recognition in the presence of a competing speaker

Xu Shao, Jon Barker

This paper examines the problem of estimating stream weights for a multistream audio-visual speech recogniser in the context of a simultaneous speaker task. The task is challenging because signal-tonoise ratio (SNR) cannot be readily inferred from the acoustics alone. The method proposed employs artificial neural networks (ANNs) to estimate the SNR from HMM state-likelihoods. SNR is converted to stream weight using a mapping optimised on development data. The method produces an audio-visual recognition performance better than that of both the audio-only and the video-only baselines across a wide range of SNRs. The performance using SNR estimates based on audio state-likelihoods is compared to that obtained using both audio and visual likelihoods. Although the audio-visual SNR estimator outperforms the audio-only SNR estimator, the recognition performance benefit is small. Ideas for making fuller use of the visual information are discussed.


doi: 10.21437/Interspeech.2006-380

Cite as: Shao, X., Barker, J. (2006) Audio-visual speech recognition in the presence of a competing speaker. Proc. Interspeech 2006, paper 1589-Tue3WeO.6, doi: 10.21437/Interspeech.2006-380

@inproceedings{shao06_interspeech,
  author={Xu Shao and Jon Barker},
  title={{Audio-visual speech recognition in the presence of a competing speaker}},
  year=2006,
  booktitle={Proc. Interspeech 2006},
  pages={paper 1589-Tue3WeO.6},
  doi={10.21437/Interspeech.2006-380}
}