Time-frequency mask estimation has shown considerable success recently. In this paper, we demonstrate its utility as a feature enhancement frontend for large vocabulary conversational speech recognition. Additionally, we investigate how masking compares with feature denoising, which directly reconstructs clean features from noisy ones. We train a mask estimator that predicts ideal ratio masks. Experimental results on Google voice search evaluation sets demonstrate that masking is superior to feature denoising, and a lightweight masking frontend produces significant improvements over a strong baseline. We also show that masking improves performance of a multi-condition trained (MTR) acoustic model.
Bibliographic reference. Wang, Yuxuan / Misra, Ananya / Chin, Kean K. (2015): "Time-frequency masking for large scale robust speech recognition", In INTERSPEECH-2015, 2469-2473.