Auditory-Visual Speech Processing (AVSP) 2010
Hakone, Kanagawa, Japan
Voice activity detection (VAD) is one of the most critical issues on performance degradation of speech recognition in noisy environment applications. A real-time VAD was developed by using face parameters (eye and lip contours) as a front-end for the traditional speech and noise (audio) GMMbased method. Speech recognition performance of the audiovisual VAD is shown to be comparable with audio-only VAD, for a shopping mall background noise. Advantages and limitations of introducing the visual information are discussed.
Index Terms: voice activity detection, audio-visual, speech recognition, noisy environment, real-time.
Bibliographic reference. Ishi, Carlos T. / Sato, Miki / Hagita, Norihiro / Lao, Shihong (2010): "Real-time audio-visual voice activity detection for speech recognition in noisy environments", In AVSP-2010, paper P5.