Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention

Dong Yu, Wayne Xiong, Jasha Droppo, Andreas Stolcke, Guoli Ye, Jinyu Li, Geoffrey Zweig


In this paper, we propose a deep convolutional neural network (CNN) with layer-wise context expansion and location-based attention, for large vocabulary speech recognition. In our model each higher layer uses information from broader contexts, along both the time and frequency dimensions, than its immediate lower layer. We show that both the layer-wise context expansion and the location-based attention can be implemented using the element-wise matrix product and the convolution operation. For this reason, contrary to other CNNs, no pooling operation is used in our model. Experiments on the 309hr Switchboard task and the 375hr short message dictation task indicates that our model outperforms both the DNN and LSTM significantly.


DOI: 10.21437/Interspeech.2016-251

Cite as

Yu, D., Xiong, W., Droppo, J., Stolcke, A., Ye, G., Li, J., Zweig, G. (2016) Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention. Proc. Interspeech 2016, 17-21.

Bibtex
@inproceedings{Yu+2016,
author={Dong Yu and Wayne Xiong and Jasha Droppo and Andreas Stolcke and Guoli Ye and Jinyu Li and Geoffrey Zweig},
title={Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-251},
url={http://dx.doi.org/10.21437/Interspeech.2016-251},
pages={17--21}
}