In this paper, we describe our systems and report our results for the CHiME-5 single-array track. We focus on front-end multi-channel speech processing, including beamforming and dereverberation. To address the complexity of the data and recording scenario, we use multiple beamformers where each beamformer targets at a predefined direction. N-Best lists are obtained from decoding each beamformed signal. These multiple N-best lists are further processed by ROVER to get the final result. Before beamforming, a multi-channel generalized weighted prediction error method is adopted to do the dereverberation. Comparing with the official baseline system, CNN-TDNN-F shows significant improvement. In language modeling, LSTM-based language model re-scoring generates additional improvement. Without system fusion, our single system can get 14.4% relative word error rate reduction on development set over the baseline system.
Cite as: Sun, S., Shi, Y., Yeh, C.-F., Bu, S., Hwang, M.-Y., Xie, L. (2018) Multiple beamformers with ROVER for the CHiME-5 Challenge. Proc. 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018), 85-87, doi: 10.21437/CHiME.2018-19
@inproceedings{sun18_chime, author={Sining Sun and Yangyang Shi and Ching-Feng Yeh and Suliang Bu and Mei-Yuh Hwang and Lei Xie}, title={{Multiple beamformers with ROVER for the CHiME-5 Challenge}}, year=2018, booktitle={Proc. 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018)}, pages={85--87}, doi={10.21437/CHiME.2018-19} }