Acoustic Modeling with Densely Connected Residual Network for Multichannel Speech Recognition

Jian Tang, Yan Song, Lirong Dai, Ian McLoughlin


Motivated by recent advances in computer vision research, this paper proposes a novel acoustic model called Densely Connected Residual Network (DenseRNet) for multichannel speech recognition. This combines the strength of both DenseNet and ResNet. It adopts the basic "building blocks" of ResNet with different convolutional layers, receptive field sizes and growth rates as basic components that are densely connected to form socalled denseR blocks. By concatenating the feature maps of all preceding layers as inputs, DenseRNet can not only strengthen gradient back-propagation for the vanishing-gradient problem, but also exploit multi-resolution feature maps. Preliminary experimental results on CHiME 3 have shown that DenseRNet achieves a word error rate (WER) of 7.58% on beamforming-enhanced speech with six channel real test data by cross entropy criteria training while WER is 10.23% for the official baseline. Besides, additional experimental results are also presented to demonstrate that DenseRNet exhibits the robustness to beamforming-enhanced speech as well as near and far-field speech.


 DOI: 10.21437/Interspeech.2018-1089

Cite as: Tang, J., Song, Y., Dai, L., McLoughlin, I. (2018) Acoustic Modeling with Densely Connected Residual Network for Multichannel Speech Recognition. Proc. Interspeech 2018, 1783-1787, DOI: 10.21437/Interspeech.2018-1089.


@inproceedings{Tang2018,
  author={Jian Tang and Yan Song and Lirong Dai and Ian McLoughlin},
  title={Acoustic Modeling with Densely Connected Residual Network for Multichannel Speech Recognition},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1783--1787},
  doi={10.21437/Interspeech.2018-1089},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1089}
}