Time-frequency mask estimation has shown considerable success recently. In this paper, we demonstrate its utility as a feature enhancement frontend for large vocabulary conversational speech recognition. Additionally, we investigate how masking compares with feature denoising, which directly reconstructs clean features from noisy ones. We train a mask estimator that predicts ideal ratio masks. Experimental results on Google voice search evaluation sets demonstrate that masking is superior to feature denoising, and a lightweight masking frontend produces significant improvements over a strong baseline. We also show that masking improves performance of a multi-condition trained (MTR) acoustic model.
Cite as: Wang, Y., Misra, A., Chin, K.K. (2015) Time-frequency masking for large scale robust speech recognition. Proc. Interspeech 2015, 2469-2473, doi: 10.21437/Interspeech.2015-533
@inproceedings{wang15h_interspeech, author={Yuxuan Wang and Ananya Misra and Kean K. Chin}, title={{Time-frequency masking for large scale robust speech recognition}}, year=2015, booktitle={Proc. Interspeech 2015}, pages={2469--2473}, doi={10.21437/Interspeech.2015-533} }