We have incorporated spectrotemporal features in a speech activity detection (SAD) task for the Speech in Noisy Environments 2 (SPINE2) data set. The features were generated by applying 2D Gabor filters to the mel spectrogram in order to measure the strength of various spectral and temporal modulation frequencies in different patches of the spectrogram. Using several different back-ends, the Gabor features significantly outperformed MFCCs, yielding relative reductions in equal error rate (EER) of between 40 and 50%. Compared to the other backends, Adaboost with tree stumps performed particularly well with Gabor features and particularly poorly with MFCCs. An investigation into the reasons for this disparity suggests that the most useful features for SAD incorporate information over longer time scales.
Index Terms: spectrotemporal features, speech activity detection
Bibliographic reference. Tsai, T. J. / Morgan, Nelson (2012): "Longer features: they do a speech detector good", In INTERSPEECH-2012, 1356-1359.