In this paper we investigate how much feature extraction is required by a deep neural network (DNN) based acoustic model for automatic speech recognition (ASR). We decompose the feature extraction pipeline of a state-of-the-art ASR system step by step and evaluate acoustic models trained on standard MFCC features, critical band energies (CRBE), FFT magnitude spectrum and even on the raw time signal. The focus is put on raw time signal as input features, i.e. as much as zero feature extraction prior to DNN training. Noteworthy, the gap in recognition accuracy between MFCC and raw time signal decreases strongly once we switch from sigmoid activation function to rectified linear units, offering a real alternative to standard signal processing. The analysis of the first layer weights reveals that the DNN can discover multiple band pass filters in time domain. Therefore we try to improve the raw time signal based system by initializing the first hidden layer weights with impulse responses of an audiologically motivated filter bank. Inspired by the multi-resolutional analysis layer learned automatically from raw time signal input, we train the DNN on a combination of multiple short-term features. This illustrates how the DNN can learn from the little differences between MFCC, PLP and Gammatone features, suggesting that it is useful to present the DNN with different views on the underlying audio.
Bibliographic reference. Tüske, Zoltán / Golik, Pavel / Schlüter, Ralf / Ney, Hermann (2014): "Acoustic modeling with deep neural networks using raw time signal for LVCSR", In INTERSPEECH-2014, 890-894.