Speech activity detection (SAD) on channel transmissions is a critical preprocessing task for speech, speaker and language recognition or for further human analysis. This paper presents a feature combination approach to improve SAD on highly channel degraded speech as part of the Defense Advanced Research Projects Agency's (DARPA) Robust Automatic Transcription of Speech (RATS) program. The key contribution is the feature combination exploration of different novel SAD features based on pitch and spectro-temporal processing and the standard Mel Frequency Cepstral Coefficients (MFCC) acoustic feature. The SAD features are: (1) a GABOR feature representation, followed by a multilayer perceptron (MLP); (2) a feature that combines multiple voicing features and spectral flux measures (Combo); (3) a feature based on subband autocorrelation (SAcC) and MLP postprocessing and (4) a multiband comb-filter F0 (MBCombF0) voicing measure. We present single, pairwise and all feature combinations, show high error reductions from pairwise feature level combination over the MFCC baseline and show that the best performance is achieved by the combination of all features.
Bibliographic reference. Graciarena, Martin / Alwan, Abeer / Ellis, Dan / Franco, Horacio / Ferrer, Luciana / Hansen, John H. L. / Janin, Adam / Lee, Byung-Suk / Lei, Yun / Mitra, Vikramjit / Morgan, Nelson / Sadjadi, Seyed Omid / Tsai, T. J. / Scheffer, Nicolas / Tan, Lee Ngee / Williams, Benjamin (2013): "All for one: feature combination for highly channel-degraded speech activity detection", In INTERSPEECH-2013, 709-713.