INTERSPEECH 2015
16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Time-Frequency Masking for Large Scale Robust Speech Recognition

Yuxuan Wang (1), Ananya Misra (2), Kean K. Chin (2)

(1) Ohio State University, USA
(2) Google, USA

Time-frequency mask estimation has shown considerable success recently. In this paper, we demonstrate its utility as a feature enhancement frontend for large vocabulary conversational speech recognition. Additionally, we investigate how masking compares with feature denoising, which directly reconstructs clean features from noisy ones. We train a mask estimator that predicts ideal ratio masks. Experimental results on Google voice search evaluation sets demonstrate that masking is superior to feature denoising, and a lightweight masking frontend produces significant improvements over a strong baseline. We also show that masking improves performance of a multi-condition trained (MTR) acoustic model.

Full Paper

Bibliographic reference.  Wang, Yuxuan / Misra, Ananya / Chin, Kean K. (2015): "Time-frequency masking for large scale robust speech recognition", In INTERSPEECH-2015, 2469-2473.