We propose a novel approach to generate time-frequency kernel based deep convolutional neural networks (CNN) for robust speech recognition. We give different treatments to shifting along the time and the frequency axes of speech feature representations in the 2D convolution, so as to achieve certain invariance in small frequency shifts while expanding time context size for speech input without smearing time positions of phone segments. The 2D-kernel approach allows easy implementation of deep CNNs. We present experimental results on speaker-independent phone recognition tasks of TIMIT and FFMTIMIT, where the latter was acquired using a far-field microphone and the speech data are noisy. Our results demonstrate that the proposed time-frequency kernel-based CNN gives consistent phone error reductions over frequency-domain CNN and DNN for both TIMIT and FFMTIMIT, with more benefits shown for recognizing noisy speech by using clean speech models.
Bibliographic reference. Zhao, Tuo / Zhao, Yunxin / Chen, Xin (2015): "Time-frequency kernel-based CNN for speech recognition", In INTERSPEECH-2015, 1888-1892.