Distilling Knowledge from an Ensemble of Models for Punctuation Prediction

Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Ya Li


This paper proposes an approach to distill knowledge from an ensemble of models to a single deep neural network (DNN) student model for punctuation prediction. This approach makes the DNN student model mimic the behavior of the ensemble. The ensemble consists of three single models. Kullback-Leibler (KL) divergence is used to minimize the difference between the output distribution of the DNN student model and the behavior of the ensemble. Experimental results on English IWSLT2011 dataset show that the ensemble outperforms the previous state-of-the-art model by up to 4.0% absolute in overall F1-score. The DNN student model also achieves up to 13.4% absolute overall F1-score improvement over the conventionally-trained baseline models.


 DOI: 10.21437/Interspeech.2017-1079

Cite as: Yi, J., Tao, J., Wen, Z., Li, Y. (2017) Distilling Knowledge from an Ensemble of Models for Punctuation Prediction. Proc. Interspeech 2017, 2779-2783, DOI: 10.21437/Interspeech.2017-1079.


@inproceedings{Yi2017,
  author={Jiangyan Yi and Jianhua Tao and Zhengqi Wen and Ya Li},
  title={Distilling Knowledge from an Ensemble of Models for Punctuation Prediction},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2779--2783},
  doi={10.21437/Interspeech.2017-1079},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1079}
}