We propose a novel data utilization strategy, called multi-channel-condition learning, leveraging upon complementary information captured in microphone array speech to jointly train dereverberation and acoustic deep neural network (DNN) models for robust distant speech recognition. Experimental results, with a single automatic speech recognition (ASR) system, on the REVERB2014 simulated evaluation data show that, on 1-channel testing, the baseline joint training scheme attains a word error rate (WER) of 7.47%, reduced from 8.72% for separate training. The proposed multi-channel-condition learning scheme has been experimented on different channel data combinations and usage showing many interesting implications. Finally, training on all 8-channel data and with DNN-based language model rescoring, a state-of-the-art WER of 4.05% is achieved. We anticipate an even lower WER when combining more top ASR systems.
Cite as: Ge, F., Li, K., Wu, B., Siniscalchi, S.M., Yan, Y., Lee, C.-H. (2017) Joint Training of Multi-Channel-Condition Dereverberation and Acoustic Modeling of Microphone Array Speech for Robust Distant Speech Recognition. Proc. Interspeech 2017, 3847-3851, doi: 10.21437/Interspeech.2017-579
@inproceedings{ge17_interspeech, author={Fengpei Ge and Kehuang Li and Bo Wu and Sabato Marco Siniscalchi and Yonghong Yan and Chin-Hui Lee}, title={{Joint Training of Multi-Channel-Condition Dereverberation and Acoustic Modeling of Microphone Array Speech for Robust Distant Speech Recognition}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3847--3851}, doi={10.21437/Interspeech.2017-579} }