Joint Training of Multi-Channel-Condition Dereverberation and Acoustic Modeling of Microphone Array Speech for Robust Distant Speech Recognition

Fengpei Ge, Kehuang Li, Bo Wu, Sabato Marco Siniscalchi, Yonghong Yan, Chin-Hui Lee


We propose a novel data utilization strategy, called multi-channel-condition learning, leveraging upon complementary information captured in microphone array speech to jointly train dereverberation and acoustic deep neural network (DNN) models for robust distant speech recognition. Experimental results, with a single automatic speech recognition (ASR) system, on the REVERB2014 simulated evaluation data show that, on 1-channel testing, the baseline joint training scheme attains a word error rate (WER) of 7.47%, reduced from 8.72% for separate training. The proposed multi-channel-condition learning scheme has been experimented on different channel data combinations and usage showing many interesting implications. Finally, training on all 8-channel data and with DNN-based language model rescoring, a state-of-the-art WER of 4.05% is achieved. We anticipate an even lower WER when combining more top ASR systems.


 DOI: 10.21437/Interspeech.2017-579

Cite as: Ge, F., Li, K., Wu, B., Siniscalchi, S.M., Yan, Y., Lee, C. (2017) Joint Training of Multi-Channel-Condition Dereverberation and Acoustic Modeling of Microphone Array Speech for Robust Distant Speech Recognition. Proc. Interspeech 2017, 3847-3851, DOI: 10.21437/Interspeech.2017-579.


@inproceedings{Ge2017,
  author={Fengpei Ge and Kehuang Li and Bo Wu and Sabato Marco Siniscalchi and Yonghong Yan and Chin-Hui Lee},
  title={Joint Training of Multi-Channel-Condition Dereverberation and Acoustic Modeling of Microphone Array Speech for Robust Distant Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3847--3851},
  doi={10.21437/Interspeech.2017-579},
  url={http://dx.doi.org/10.21437/Interspeech.2017-579}
}