16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR

Pavel Golik, Zoltán Tüske, Ralf Schlüter, Hermann Ney

RWTH Aachen University, Germany

In this paper we continue to investigate how the deep neural network (DNN) based acoustic models for automatic speech recognition can be trained without hand-crafted feature extraction. Previously, we have shown that a simple fully connected feedforward DNN performs surprisingly well when trained directly on the raw time signal. The analysis of the weights revealed that the DNN has learned a kind of short-time time-frequency decomposition of the speech signal. In conventional feature extraction pipelines this is done manually by means of a filter bank that is shared between the neighboring analysis windows.
    Following this idea, we show that the performance gap between DNNs trained on spliced hand-crafted features and DNNs trained on raw time signal can be strongly reduced by introducing 1D-convolutional layers. Thus, the DNN is forced to learn a short-time filter bank shared over a longer time span. This also allows us to interpret the weights of the second convolutional layer in the same way as 2D patches learned on critical band energies by typical convolutional neural networks.
    The evaluation is performed on an English LVCSR task. Trained on the raw time signal, the convolutional layers allow to reduce the WER on the test set from 25.5% to 23.4%, compared to an MFCC based result of 22.1% using fully connected layers.

Full Paper

Bibliographic reference.  Golik, Pavel / Tüske, Zoltán / Schlüter, Ralf / Ney, Hermann (2015): "Convolutional neural networks for acoustic modeling of raw time signal in LVCSR", In INTERSPEECH-2015, 26-30.