ISCA Archive AVSP 2003
ISCA Archive AVSP 2003

Improving audio-visual speech recognition with an infrared headset

Jing Huang, Gerasimos Potamianos, Chalapathy Neti

Visual speech is known to improve accuracy and noise robustness of automatic speech recognizers. However, almost all audio-visual ASR systems require tracking frontal facial features for visual information extraction, a computationally intensive and error-prone process. In this paper, we consider a specially designed infrared headset to capture audio-visual data, that consistently focuses on the speaker’s mouth region, thus eliminating the need for face tracking. We conduct small-vocabulary recognition experiments on such data, benchmarking their ASR performance against traditional frontal, fullface videos, collected both at an ideal studio-like environment and at a more challenging office domain. By using the infrared headset, we report a dramatic improvement in visual-only ASR that amounts to a relative 30% and 54% word error rate reduction, compared to the studio and office data, respectively. Furthermore, when combining the visual modality with the acoustic signal, the resulting relative ASR gain with respect to audio-only performance is significantly higher for the infrared headset data.

Cite as: Huang, J., Potamianos, G., Neti, C. (2003) Improving audio-visual speech recognition with an infrared headset. Proc. Auditory-Visual Speech Processing, 175-178

  author={Jing Huang and Gerasimos Potamianos and Chalapathy Neti},
  title={{Improving audio-visual speech recognition with an infrared headset}},
  booktitle={Proc. Auditory-Visual Speech Processing},