Automatic Speech Recognition (ASR) which plays an important role in human-robot interaction should be noise-robust because robots are expected to work in noisy environments. Audio-Visual (AV) integration is one of the key ideas to improve robustness in such environments. This paper proposes two-layered AV integration for an ASR system which applies AV integration to Voice Activity Detection (VAD) and ASR decoding processes. We implement a prototype ASR system based on the proposed two-layered AV integration and evaluated the system in dynamically-changing situations where audio and/or visual information can be noisy or missing. Preliminary results showed that the proposed method improves the robustness of ASR system even in auditory- or visually-contaminated situations.
Bibliographic reference. Yoshida, Takami / Nakadai, Kazuhiro (2010): "Two-layered audio-visual integration in voice activity detection and automatic speech recognition for robots", In INTERSPEECH-2010, 2702-2705.