Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition

Yevgen Chebotar, Austin Waters


Speech recognition systems that combine multiple types of acoustic models have been shown to outperform single-model systems. However, such systems can be complex to implement and too resource-intensive to use in production. This paper describes how to use knowledge distillation to combine acoustic models in a way that has the best of many worlds: It improves recognition accuracy significantly, can be implemented with standard training tools, and requires no additional complexity during recognition. First, we identify a simple but particularly strong type of ensemble: a late combination of recurrent neural networks with different architectures and training objectives. To harness such an ensemble, we use a variant of standard cross-entropy training to distill it into a single model and then discriminatively fine-tune the result. An evaluation on 2,000-hour large vocabulary tasks in 5 languages shows that the distilled models provide up to 8.9% relative WER improvement over conventionally-trained baselines with an identical number of parameters.


DOI: 10.21437/Interspeech.2016-1190

Cite as

Chebotar, Y., Waters, A. (2016) Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition. Proc. Interspeech 2016, 3439-3443.

Bibtex
@inproceedings{Chebotar+2016,
author={Yevgen Chebotar and Austin Waters},
title={Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1190},
url={http://dx.doi.org/10.21437/Interspeech.2016-1190},
pages={3439--3443}
}