On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones

Yan-Hui Tu, Jun Du, Lei Sun, Feng Ma, Chin-Hui Lee


We design a novel deep learning framework for multi-channel speech recognition in two aspects. First, for the front-end, an iterative mask estimation (IME) approach based on deep learning is presented to improve the beamforming approach based on the conventional complex Gaussian mixture model (CGMM). Second, for the back-end, deep convolutional neural networks (DCNNs), with augmentation of both noisy and beamformed training data, are adopted for acoustic modeling while the forward and backward long short-term memory recurrent neural networks (LSTM-RNNs) are used for language modeling. The proposed framework can be quite effective to multi-channel speech recognition with random combinations of fixed microphones. Testing on the CHiME-4 Challenge speech recognition task with a single set of acoustic and language models, our approach achieves the best performance of all three tracks (1-channel, 2-channel, and 6-channel) among submitted systems.


 DOI: 10.21437/Interspeech.2017-853

Cite as: Tu, Y., Du, J., Sun, L., Ma, F., Lee, C. (2017) On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones. Proc. Interspeech 2017, 394-398, DOI: 10.21437/Interspeech.2017-853.


@inproceedings{Tu2017,
  author={Yan-Hui Tu and Jun Du and Lei Sun and Feng Ma and Chin-Hui Lee},
  title={On Design of Robust Deep Models for CHiME-4 Multi-Channel Speech Recognition with Multiple Configurations of Array Microphones},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={394--398},
  doi={10.21437/Interspeech.2017-853},
  url={http://dx.doi.org/10.21437/Interspeech.2017-853}
}