Convolutional Neural Networks (CNNs) have achieved state-of-the-art performance when embedded in large vocabulary continuous speech recognition (LVCSR) systems due to its capability of modeling local correlations and reducing translational variations. In all previous related works for ASR, only up to two convolutional layers are employed. In light of the recent success of very deep CNNs in image classification, it is of interest to investigate the deep structure of CNNs for speech recognition in detail. In contrast to image classification, the dimensionality of the speech feature, the span size of input feature and the relationship between temporal and spectral domain are new factors to consider while designing very deep CNNs. In this work, very deep CNNs are introduced for LVCSR task, by extending depth of convolutional layers up to ten. The contribution of this work is two-fold: performance improvement of very deep CNNs is investigated under different configurations; further, a better way to perform convolution operations on temporal dimension is proposed. Experiments showed that very deep CNNs offer a 8-12% relative improvement over baseline DNN system, and a 4-7% relative improvement over baseline CNN system, evaluated on both a 15-hr Callhome and a 51-hr Switchboard LVCSR tasks.
Bibliographic reference. Bi, Mengxiao / Qian, Yanmin / Yu, Kai (2015): "Very deep convolutional neural networks for LVCSR", In INTERSPEECH-2015, 3259-3263.