Reliable automatic detection of speech/non-speech activity in degraded, noisy audio signals is a fundamental and challenging task in robust signal processing. As various speech technology applications rely on the accuracy of a Voice Activity Detection (VAD) system for their effectiveness and robustness, the problem has gained considerable research interest over the years. It has been shown that in highly distorted conditions, an accurate segmentation of the target speech can be achieved by combining multiple feature streams. In this paper, we extract four one-dimensional streams each attempting to separate speech from the disturbing background by exploiting a different speech-related characteristic, i.e. (i) the spectral shape, (ii) spectro-temporal modulations, (iii) the periodicity structure due to the presence of pitch harmonics, and (iv) the long-term spectral variability profile. The information from these streams is then expanded over long duration context windows and applied to the input layer of a standard Multilayer Perceptron classifier. The proposed VAD was evaluated on the DARPA RATS corpora and shows to be very competitive to current state-of-the art systems.
Bibliographic reference. Segbroeck, Maarten Van / Tsiartas, Andreas / Narayanan, Shrikanth (2013): "A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice", In INTERSPEECH-2013, 704-708.