Processing noisy signals using the ideal binary mask has been shown to improve automatic speech recognition (ASR) performance. In this paper, we present the first study that investigates the role of mask patterns in ASR under varying signal-to-noise ratios (SNR), noise conditions and mask definitions. Binary masks are typically computed either by comparing the local SNR within a time-frequency unit of a mixture signal with a threshold termed the local criterion (LC), or by comparing the local target energy with the long-term average energy of speech. Results show that: (i) Akin to human speech recognition, binary masking can significantly improve ASR even when the mixture SNR is as low as -60 dB. (ii) The difference between the LC and the mixture SNR is more correlated to the recognition accuracy than LC. (iii) The performance profiles in ASR are qualitatively similar to those obtained for human speech recognition. (iv) The LC at which the peak performance is obtained is lower than 0 dB, which is the optimal threshold as far as the SNR gain of processed signals is concerned. This indicates that maximizing SNR gain may not be the optimal criterion to improve either human or machine recognition of noisy speech.
Index Terms: computational auditory scene analysis, ideal binary mask, automatic speech recognition, mask pattern
Bibliographic reference. Narayanan, Arun / Wang, DeLiang (2012): "On the role of binary mask pattern in automatic speech recognition", In INTERSPEECH-2012, 1239-1242.