In this paper, we describe our ZTSpeech for two tracks of CHiME-5 challenge. For front-end, our experiments conduct the comparisons between several popular beamforming methods. Besides, we also propose a omnidirectional minimum variance distortionless response (OMVDR) followed by weighted prediction error (WPE). Furthermore, we investigate the impact of data augmentation and data combinations. For back-end, several acoustic models (AMs) with different architectures are deeply investigated. N-gram-based and recurrent neural network (RNN)-based language models (LMs) are both evaluated. For single-array track, by combining the most effective approaches, our final system can achieve 11.94% promotion on performance in evaluation set, from 73.27% to 61.33%. For multiple-array track, our final system can achieve 12.29% improvement in evaluation set, from 73.30% to 61.01%.
Cite as: Li, C., Wang, T. (2018) The ZTSpeech system for CHiME-5 Challenge: A far-field speech recognition system with front-end and robust back-end. Proc. 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018), 58-63, doi: 10.21437/CHiME.2018-13
@inproceedings{li18_chime, author={Chenxing Li and Tieqiang Wang}, title={{The ZTSpeech system for CHiME-5 Challenge: A far-field speech recognition system with front-end and robust back-end}}, year=2018, booktitle={Proc. 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018)}, pages={58--63}, doi={10.21437/CHiME.2018-13} }