We introduce a novel spectro-temporal representation of speech by applying directional derivative filters to the Mel-spectrogram, with the aim of improving the robustness of automatic speech recognition. Previous studies have shown that two-dimensional wavelet functions, when tuned to appropriate spectral scales and temporal rates, are able to accurately capture the acoustic modulations of speech, even in high noise conditions. Therefore, spectro-temporal features extracted from the wavelet transformation of the spectrogram, offer additional noise robustness to important signal processing tasks, such as voice activity detection and speech recognition. In this paper, we explore the use of the steerable pyramid, a directional wavelet transform that is common in image processing, to derive a spectro-temporal feature representation of speech that can serve as an alternative to cepstral derivatives and Gabor filter-bank features. We discuss their application for the task of robust automatic speech recognition. Experiments conducted on the Aurora-2 database demonstrate their competitive robustness to other state-of-the-art speech features, especially in low signal-to-noise ratio conditions.
Bibliographic reference. Gibson, James / Segbroeck, Maarten Van / Ortega, Antonio / Georgiou, Panayiotis G. / Narayanan, Shrikanth (2013): "Spectro-temporal directional derivative features for automatic speech recognition", In INTERSPEECH-2013, 872-875.