ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection

Fei Tao, Carlos Busso

Voice activity detection (VAD) is an important preprocessing step in speech-based systems, especially for emerging hand-free intelligent assistants. Conventional VAD systems relying on audio-only features are normally impaired by noise in the environment. An alternative approach to address this problem is audiovisual VAD (AV-VAD) systems. Modeling timing dependencies between acoustic and visual features is a challenge in AV-VAD. This study proposes a bimodal recurrent neural network (RNN) which combines audiovisual features in a principled, unified framework, capturing the timing dependency within modalities and across modalities. Each modality is modeled with separate bidirectional long short-term memory (BLSTM) networks. The output layers are used as input of another BLSTM network. The experimental evaluation considers a large audiovisual corpus with clean and noisy recordings to assess the robustness of the approach. The proposed approach outperforms audio-only VAD by 7.9% (absolute) under clean/ideal conditions (i.e., high definition (HD) camera, close-talk microphone). The proposed solution outperforms the audio-only VAD system by 18.5% (absolute) when the conditions are more challenging (i.e., camera and microphone from a tablet with noise in the environment). The proposed approach shows the best performance and robustness across a varieties of conditions, demonstrating its potential for real-world applications.

doi: 10.21437/Interspeech.2017-1573

Cite as: Tao, F., Busso, C. (2017) Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection. Proc. Interspeech 2017, 1938-1942, doi: 10.21437/Interspeech.2017-1573

  author={Fei Tao and Carlos Busso},
  title={{Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection}},
  booktitle={Proc. Interspeech 2017},