Compression of CTC-Trained Acoustic Models by Dynamic Frame-Wise Distillation or Segment-Wise N-Best Hypotheses Imitation

Haisong Ding, Kai Chen, Qiang Huo


Knowledge distillation (KD) has been widely used for model compression by learning a simpler student model to imitate the outputs or intermediate representations of a more complex teacher model. The most commonly used KD technique is to minimize a Kullback-Leibler divergence between the output distributions of the teacher and student models. When it is applied to compressing CTC-trained acoustic models, an assumption is made that the teacher and student share the same frame-wise feature-transcription alignment, which is usually not true due to the topology difference of the teacher and student models. In this paper, by making more appropriate assumptions, we propose two KD methods, namely dynamic frame-wise distillation and segment-wise N-best hypotheses imitation. Experimental results on Switchboard-I speech recognition task show that the segment-wise N-best hypotheses imitation outperforms the frame-level and other sequence-level distillation methods, and achieves a relative word error rate reduction of 5%–8% compared with models trained from scratch.


 DOI: 10.21437/Interspeech.2019-2182

Cite as: Ding, H., Chen, K., Huo, Q. (2019) Compression of CTC-Trained Acoustic Models by Dynamic Frame-Wise Distillation or Segment-Wise N-Best Hypotheses Imitation. Proc. Interspeech 2019, 3218-3222, DOI: 10.21437/Interspeech.2019-2182.


@inproceedings{Ding2019,
  author={Haisong Ding and Kai Chen and Qiang Huo},
  title={{Compression of CTC-Trained Acoustic Models by Dynamic Frame-Wise Distillation or Segment-Wise N-Best Hypotheses Imitation}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3218--3222},
  doi={10.21437/Interspeech.2019-2182},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2182}
}