Voice activity detection (VAD) is an important preprocessing step in speech-based systems, especially for emerging hand-free intelligent assistants. Conventional VAD systems relying on audio-only features are normally impaired by noise in the environment. An alternative approach to address this problem is audiovisual VAD (AV-VAD) systems. Modeling timing dependencies between acoustic and visual features is a challenge in AV-VAD. This study proposes a bimodal recurrent neural network (RNN) which combines audiovisual features in a principled, unified framework, capturing the timing dependency within modalities and across modalities. Each modality is modeled with separate bidirectional long short-term memory (BLSTM) networks. The output layers are used as input of another BLSTM network. The experimental evaluation considers a large audiovisual corpus with clean and noisy recordings to assess the robustness of the approach. The proposed approach outperforms audio-only VAD by 7.9% (absolute) under clean/ideal conditions (i.e., high definition (HD) camera, close-talk microphone). The proposed solution outperforms the audio-only VAD system by 18.5% (absolute) when the conditions are more challenging (i.e., camera and microphone from a tablet with noise in the environment). The proposed approach shows the best performance and robustness across a varieties of conditions, demonstrating its potential for real-world applications.
Cite as: Tao, F., Busso, C. (2017) Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection. Proc. Interspeech 2017, 1938-1942, doi: 10.21437/Interspeech.2017-1573
@inproceedings{tao17_interspeech, author={Fei Tao and Carlos Busso}, title={{Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1938--1942}, doi={10.21437/Interspeech.2017-1573} }