Investigation of Cost Function for Supervised Monaural Speech Separation

Yun Liu, Hui Zhang, Xueliang Zhang, Yuhang Cao

Speech separation aims to improve the speech quality of noisy speech. Deep learning based speech separation methods usually use mean square error (MSE) as the cost function, which measures the distance between model output and training target. However, the MSE does not match the evaluation metrics perfectly. Optimizing the MSE does not directly lead to improvement in the commonly used metrics, such as short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), signal-to-noise ratio (SNR) and source-to-distortion ratio (SDR). In this study, we inspect some other cost function candidates which are based on divergence, e.g., Kullback-Leibler and Itakura-Saito divergence. A conjecture about the correlation between cost function and evaluation metrics is proposed and examined to explain why these cost functions behave differently. On the basis of the proposed conjecture, the optimal cost function candidate is selected. The experimental results validate our conjecture.

 DOI: 10.21437/Interspeech.2019-1897

Cite as: Liu, Y., Zhang, H., Zhang, X., Cao, Y. (2019) Investigation of Cost Function for Supervised Monaural Speech Separation. Proc. Interspeech 2019, 3178-3182, DOI: 10.21437/Interspeech.2019-1897.

  author={Yun Liu and Hui Zhang and Xueliang Zhang and Yuhang Cao},
  title={{Investigation of Cost Function for Supervised Monaural Speech Separation}},
  booktitle={Proc. Interspeech 2019},