Traditional automatic speech recognition (ASR) systems usually get a sharp performance drop when noise presents in speech. To make a robust ASR, we introduce a new model using the multi-task learning deep neural networks (MTL-DNN) to solve the speech denoising task in feature level. In this model, the networks are initialized by pre-training restricted Boltzmann machines (RBM) and fine-tuned by jointly learning multiple interactive tasks using a shared representation. In multi-task learning, we choose a noisy-clean speech pair fitting task as the primary task and separately explore two constraints as the secondary tasks: phone label and phone cluster. In experiments, the denoised speech is reconstructed by the MTL-DNN using the noisy speech as input and it is respectively evaluated by the DNN-hidden Markov model (HMM) based and the Gaussian Mixture Model (GMM)-HMM based ASR systems. Results show that, using the denoised speech, the word error rate (WER) is respectively reduced by 53.14% and 34.84% compared with baselines. The MTL-DNN model also outperforms the general single-task learning deep neural networks (STL-DNN) model with a performance improvement of 4.93% and 3.88% respectively.
Bibliographic reference. Huang, Bin / Ke, Dengfeng / Zheng, Hao / Xu, Bo / Xu, Yanyan / Su, Kaile (2015): "Multi-task learning deep neural networks for speech feature denoising", In INTERSPEECH-2015, 2464-2468.