AVSP 2003 - International Conference on Audio-Visual Speech Processing
September 4-7, 2003
Visual speech is known to improve accuracy and noise robustness of automatic speech recognizers. However, almost all audio-visual ASR systems require tracking frontal facial features for visual information extraction, a computationally intensive and error-prone process. In this paper, we consider a specially designed infrared headset to capture audio-visual data, that consistently focuses on the speakerís mouth region, thus eliminating the need for face tracking. We conduct small-vocabulary recognition experiments on such data, benchmarking their ASR performance against traditional frontal, fullface videos, collected both at an ideal studio-like environment and at a more challenging office domain. By using the infrared headset, we report a dramatic improvement in visual-only ASR that amounts to a relative 30% and 54% word error rate reduction, compared to the studio and office data, respectively. Furthermore, when combining the visual modality with the acoustic signal, the resulting relative ASR gain with respect to audio-only performance is significantly higher for the infrared headset data.
Bibliographic reference. Huang, Jing / Potamianos, Gerasimos / Neti, Chalapathy (2003): "Improving audio-visual speech recognition with an infrared headset", In AVSP 2003, 175-178.