16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Time-Frequency Kernel-Based CNN for Speech Recognition

Tuo Zhao (1), Yunxin Zhao (1), Xin Chen (2)

(1) University of Missouri, USA
(2) Pearson Knowledge Technologies, USA

We propose a novel approach to generate time-frequency kernel based deep convolutional neural networks (CNN) for robust speech recognition. We give different treatments to shifting along the time and the frequency axes of speech feature representations in the 2D convolution, so as to achieve certain invariance in small frequency shifts while expanding time context size for speech input without smearing time positions of phone segments. The 2D-kernel approach allows easy implementation of deep CNNs. We present experimental results on speaker-independent phone recognition tasks of TIMIT and FFMTIMIT, where the latter was acquired using a far-field microphone and the speech data are noisy. Our results demonstrate that the proposed time-frequency kernel-based CNN gives consistent phone error reductions over frequency-domain CNN and DNN for both TIMIT and FFMTIMIT, with more benefits shown for recognizing noisy speech by using clean speech models.

Full Paper

Bibliographic reference.  Zhao, Tuo / Zhao, Yunxin / Chen, Xin (2015): "Time-frequency kernel-based CNN for speech recognition", In INTERSPEECH-2015, 1888-1892.