Traditional automatic speech recognition (ASR) systems usually get a sharp performance drop when noise presents in speech. To make a robust ASR, we introduce a new model using the multi-task learning deep neural networks (MTL-DNN) to solve the speech denoising task in feature level. In this model, the networks are initialized by pre-training restricted Boltzmann machines (RBM) and fine-tuned by jointly learning multiple interactive tasks using a shared representation. In multi-task learning, we choose a noisy-clean speech pair fitting task as the primary task and separately explore two constraints as the secondary tasks: phone label and phone cluster. In experiments, the denoised speech is reconstructed by the MTL-DNN using the noisy speech as input and it is respectively evaluated by the DNN-hidden Markov model (HMM) based and the Gaussian Mixture Model (GMM)-HMM based ASR systems. Results show that, using the denoised speech, the word error rate (WER) is respectively reduced by 53.14% and 34.84% compared with baselines. The MTL-DNN model also outperforms the general single-task learning deep neural networks (STL-DNN) model with a performance improvement of 4.93% and 3.88% respectively.
Cite as: Huang, B., Ke, D., Zheng, H., Xu, B., Xu, Y., Su, K. (2015) Multi-task learning deep neural networks for speech feature denoising. Proc. Interspeech 2015, 2464-2468, doi: 10.21437/Interspeech.2015-532
@inproceedings{huang15e_interspeech, author={Bin Huang and Dengfeng Ke and Hao Zheng and Bo Xu and Yanyan Xu and Kaile Su}, title={{Multi-task learning deep neural networks for speech feature denoising}}, year=2015, booktitle={Proc. Interspeech 2015}, pages={2464--2468}, doi={10.21437/Interspeech.2015-532} }