Multichannel Spatial Clustering for Robust Far-Field Automatic Speech Recognition in Mismatched Conditions

Michael I. Mandel, Jon Barker


Recent automatic speech recognition (ASR) results are quite good when the training data is matched to the test data, but much worse when they differ in some important regard, like the number and arrangement of microphones or differences in reverberation and noise conditions. This paper proposes an unsupervised spatial clustering approach to microphone array processing that can overcome such train-test mismatches. This approach, known as Model-based EM Source Separation and Localization (MESSL), clusters spectrogram points based on the relative differences in phase and level between pairs of microphones. Here it is used for the first time to drive minimum variance distortionless response (MVDR) beamforming in several ways. We compare it to a standard delay-and-sum beamformer on the CHiME-3 noisy test set (real recordings), using each system as a pre-processor for the same recognizer trained on the AMI meeting corpus. We find that the spatial clustering front end reduces word error rates by between 9.9 and 17.1% relative to the baseline.


DOI: 10.21437/Interspeech.2016-1275

Cite as

Mandel, M.I., Barker, J. (2016) Multichannel Spatial Clustering for Robust Far-Field Automatic Speech Recognition in Mismatched Conditions. Proc. Interspeech 2016, 1991-1995.

Bibtex
@inproceedings{Mandel+2016,
author={Michael I. Mandel and Jon Barker},
title={Multichannel Spatial Clustering for Robust Far-Field Automatic Speech Recognition in Mismatched Conditions},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1275},
url={http://dx.doi.org/10.21437/Interspeech.2016-1275},
pages={1991--1995}
}