Knowledge Distillation for Sequence Model

Mingkun Huang, Yongbin You, Zhehuai Chen, Yanmin Qian, Kai Yu

Knowledge distillation, or teacher-student training, has been effectively used to improve the performance of a relatively simpler deep learning model (the student) using a more complex model (the teacher). It is usually done by minimizing the Kullback-Leibler divergence (KLD) between the output distributions of the student and the teacher at each frame. However, the gain from frame-level knowledge distillation is limited for sequence models such as Connectionist Temporal Classification (CTC), due to the mismatch between the sequence-level criterion used in teacher model training and the frame-level criterion used in distillation. In this paper, sequence-level knowledge distillation is proposed to achieve better distillation performance. Instead of calculating a teacher posterior distribution given the feature vector of the current frame, sequence training criterion is employed to calculate the posterior distribution given the whole utterance and the teacher model. Experiments are conducted on both English Switchboard corpus and a large Chinese corpus. The proposed approach achieves significant and consistent improvements over the traditional frame-level knowledge distillation using both labeled and unlabeled data.

 DOI: 10.21437/Interspeech.2018-1589

Cite as: Huang, M., You, Y., Chen, Z., Qian, Y., Yu, K. (2018) Knowledge Distillation for Sequence Model. Proc. Interspeech 2018, 3703-3707, DOI: 10.21437/Interspeech.2018-1589.

  author={Mingkun Huang and Yongbin You and Zhehuai Chen and Yanmin Qian and Kai Yu},
  title={Knowledge Distillation for Sequence Model},
  booktitle={Proc. Interspeech 2018},