Semi-Supervised Training in Deep Learning Acoustic Model

Yan Huang, Yongqiang Wang, Yifan Gong


We studied semi-supervised training in a fully connected deep neural network (DNN), unfolded recurrent neural network (RNN), and long short-term memory recurrent neural network (LSTM-RNN) with respect to transcription quality, importance data sampling, and training data amount. We found that DNN, unfolded RNN, and LSTM-RNN exhibit increased sensitivity to labeling errors. One point relative WER increase in the training transcription translates to a half point WER increase in DNN and slightly more in unfolded RNN; while in LSTM-RNN it translates to one full point WER increase. LSTM-RNN is notably more sensitive to transcription errors. We further found that the importance sampling has similar impact on all three models. In supervised training, importance sampling yields 2~3% relative WER reduction against random sampling. The gain is reduced in semi-supervised training. Lastly, we compared the model capacity with increased training data. Experimental results suggest that LSTM-RNN can benefit more from enlarged training data comparing to unfolded RNN and DNN.

We trained a semi-supervised LSTM-RNN using 2600 hours of transcribed and 10000 hours of untranscribed data on a mobile speech task. The semi-supervised LSTM-RNN yields 6.56% relative WER reduction against the supervised baseline trained from 2600 hours of transcribed speech.


DOI: 10.21437/Interspeech.2016-1596

Cite as

Huang, Y., Wang, Y., Gong, Y. (2016) Semi-Supervised Training in Deep Learning Acoustic Model. Proc. Interspeech 2016, 3848-3852.

Bibtex
@inproceedings{Huang+2016,
author={Yan Huang and Yongqiang Wang and Yifan Gong},
title={Semi-Supervised Training in Deep Learning Acoustic Model},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1596},
url={http://dx.doi.org/10.21437/Interspeech.2016-1596},
pages={3848--3852}
}