INTERSPEECH 2004 - ICSLP
A time-varying Weiner filter extracts the speech signal from a noisy mixture using the a priori signal-to-noise ratio in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing data recognizer that operates in the spectral domain using the time-frequency units dominated by speech. For use by the missing data recognizer, the same processor is used to estimate an ideal time-frequency binary mask, which selects the speech signal if it is stronger than the interference in a local time-frequency unit. We find that the performance of the missing data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is larger.
Bibliographic reference. Srinivasan, Soundararajan / Roman, Nicoleta / Wang, DeLiang (2004): "On binary and ratio time-frequency masks for robust speech recognition", In INTERSPEECH-2004, 2541-2544.