We present our work on constructing multi-scale deep convolutional neural networks for automatic speech recognition. Several VGG nets have been trained that differ solely in the kernel size of the convolutional layers. The general idea is that receptive fields of varying sizes match structures of different scales, thus supporting more robust recognition when combined appropriately. We construct a large multi-scale system by means of system combination. We use ROVER and the fusion of posterior predictions as examples of late combination, and knowledge distillation using soft labels from a model ensemble as a way of early combination. In this work, distillation is approached from the perspective of knowledge transfer pre-training, which is followed by a fine-tuning on the original hard labels. Our results show that it is possible to bundle the individual recognition strengths of the VGGs in a much simpler CNN architecture that yields equal performance with the best late combination.
Cite as: Heck, M., Suzuki, M., Fukuda, T., Kurata, G., Nakamura, S. (2017) Ensembles of Multi-Scale VGG Acoustic Models. Proc. Interspeech 2017, 1616-1620, doi: 10.21437/Interspeech.2017-920
@inproceedings{heck17_interspeech, author={Michael Heck and Masayuki Suzuki and Takashi Fukuda and Gakuto Kurata and Satoshi Nakamura}, title={{Ensembles of Multi-Scale VGG Acoustic Models}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1616--1620}, doi={10.21437/Interspeech.2017-920} }