Real Time Online Visual End Point Detection Using Unidirectional LSTM

Tanay Sharma, Rohith Chandrashekar Aralikatti, Dilip Kumar Margam, Abhinav Thanda, Sharad Roy, Pujitha Appan Kandala, Shankar M. Venkatesan

Visual Voice Activity Detection (V-VAD) involves the detection of speech activity of a speaker using visual features. The V-VAD is useful in detecting the end point of an utterance under noisy acoustic conditions or for maintaining speaker privacy. In this paper, we propose a speaker independent, real-time solution for V-VAD. The focus is on real-time aspect and accuracy as such algorithms will play a key role in detecting end point especially while interacting with speech assistants. We propose two novel methods one using CNN and the other using 2D-DCT features. Unidirectional LSTMs are used in both the methods to make it online and learn temporal dependence. The methods are tested on two publicly available datasets. Additionally the methods are also tested on a locally collected dataset which further validates our hypothesis. Additionally it has been observed through experiments that both the approaches generalize to unseen speakers. It has been shown that our best approach gives substantial improvement over earlier methods done on the same dataset.

 DOI: 10.21437/Interspeech.2019-3253

Cite as: Sharma, T., Aralikatti, R.C., Margam, D.K., Thanda, A., Roy, S., Kandala, P.A., Venkatesan, S.M. (2019) Real Time Online Visual End Point Detection Using Unidirectional LSTM. Proc. Interspeech 2019, 2000-2004, DOI: 10.21437/Interspeech.2019-3253.

  author={Tanay Sharma and Rohith Chandrashekar Aralikatti and Dilip Kumar Margam and Abhinav Thanda and Sharad Roy and Pujitha Appan Kandala and Shankar M. Venkatesan},
  title={{Real Time Online Visual End Point Detection Using Unidirectional LSTM}},
  booktitle={Proc. Interspeech 2019},