8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

On Binary and Ratio Time-Frequency Masks for Robust Speech Recognition

Soundararajan Srinivasan, Nicoleta Roman, DeLiang Wang

The Ohio State University, USA

A time-varying Weiner filter extracts the speech signal from a noisy mixture using the a priori signal-to-noise ratio in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing data recognizer that operates in the spectral domain using the time-frequency units dominated by speech. For use by the missing data recognizer, the same processor is used to estimate an ideal time-frequency binary mask, which selects the speech signal if it is stronger than the interference in a local time-frequency unit. We find that the performance of the missing data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is larger.

Full Paper

Bibliographic reference.  Srinivasan, Soundararajan / Roman, Nicoleta / Wang, DeLiang (2004): "On binary and ratio time-frequency masks for robust speech recognition", In INTERSPEECH-2004, 2541-2544.