In this paper we continue to investigate how the deep neural network
(DNN) based acoustic models for automatic speech recognition can be
trained without hand-crafted feature extraction. Previously, we have
shown that a simple fully connected feedforward DNN performs surprisingly
well when trained directly on the raw time signal. The analysis of
the weights revealed that the DNN has learned a kind of short-time
time-frequency decomposition of the speech signal. In conventional
feature extraction pipelines this is done manually by means of a filter
bank that is shared between the neighboring analysis windows.
Following this idea, we show that the performance gap between DNNs trained on spliced hand-crafted features and DNNs trained on raw time signal can be strongly reduced by introducing 1D-convolutional layers. Thus, the DNN is forced to learn a short-time filter bank shared over a longer time span. This also allows us to interpret the weights of the second convolutional layer in the same way as 2D patches learned on critical band energies by typical convolutional neural networks.
The evaluation is performed on an English LVCSR task. Trained on the raw time signal, the convolutional layers allow to reduce the WER on the test set from 25.5% to 23.4%, compared to an MFCC based result of 22.1% using fully connected layers.
Bibliographic reference. Golik, Pavel / Tüske, Zoltán / Schlüter, Ralf / Ney, Hermann (2015): "Convolutional neural networks for acoustic modeling of raw time signal in LVCSR", In INTERSPEECH-2015, 26-30.