This work proposes and compares perceptually motivated loss functions for deep learning based binary mask estimation for speech separation. Previous loss functions have focused on maximising classification accuracy of mask estimation but we now propose loss functions that aim to maximise the hit minus false-alarm (HIT-FA) rate which is known to correlate more closely to speech intelligibility. The baseline loss function is binary cross-entropy (CE), a standard loss function used in binary mask estimation, which maximises classification accuracy. We propose first a loss function that maximises the HIT-FA rate instead of classification accuracy. We then propose a second loss function that is a hybrid between CE and HIT-FA, providing a balance between classification accuracy and HIT-FA rate. Evaluations of the perceptually motivated loss functions with the GRID database show improvements to HIT-FA rate and ESTOI across babble and factory noises. Further tests then explore application of the perceptually motivated loss functions to a larger vocabulary dataset.
Cite as: Websdale, D., Milner, B. (2017) A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation. Proc. Interspeech 2017, 2003-2007, doi: 10.21437/Interspeech.2017-1504
@inproceedings{websdale17_interspeech, author={Danny Websdale and Ben Milner}, title={{A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2003--2007}, doi={10.21437/Interspeech.2017-1504} }