Features inspired by the auditory system have previously demonstrated improvement in automatic speech recognition (ASR). Similarly, the use of Deep Neural Networks (DNN) was found to outperform classic approaches to ASR in many conditions. Since DNNs have the potential to learn the task-relevant features from a conventional filter bank output, we investigate if the combination of auditory features and deep learning should be preferred over self-learned patterns. Specifically, noise-robust Gabor features and Amplitude Modulation Filter-Bank (AMFB) features, highly invariant against reverberation, are used as input to a state-of-the-art ASR system incorporating DNN processing. On the Aurora-4 task, both mel-frequency cepstral coefficients (MFCC) and filter bank (FBank) features are outperformed in many acoustic conditions through auditory processing, yielding average relative improvements of up to 69% over MFCC and 21% over the commonly used DNN-FBank setup. This highlights the mutual benefit of auditory signal processing and recent advances in machine learning.
Bibliographic reference. Martinez, Angel Mario Castro / Moritz, Niko / Meyer, Bernd T. (2014): "Should deep neural nets have ears? the role of auditory features in deep learning approaches", In INTERSPEECH-2014, 2435-2439.