11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Two-Layered Audio-Visual Integration in Voice Activity Detection and Automatic Speech Recognition for Robots

Takami Yoshida (1), Kazuhiro Nakadai (2)

(1) Tokyo Institute of Technology, Japan
(2) Honda Research Institute Japan Co. Ltd., Japan

Automatic Speech Recognition (ASR) which plays an important role in human-robot interaction should be noise-robust because robots are expected to work in noisy environments. Audio-Visual (AV) integration is one of the key ideas to improve robustness in such environments. This paper proposes two-layered AV integration for an ASR system which applies AV integration to Voice Activity Detection (VAD) and ASR decoding processes. We implement a prototype ASR system based on the proposed two-layered AV integration and evaluated the system in dynamically-changing situations where audio and/or visual information can be noisy or missing. Preliminary results showed that the proposed method improves the robustness of ASR system even in auditory- or visually-contaminated situations.

Full Paper

Bibliographic reference.  Yoshida, Takami / Nakadai, Kazuhiro (2010): "Two-layered audio-visual integration in voice activity detection and automatic speech recognition for robots", In INTERSPEECH-2010, 2702-2705.