Investigations on Data Augmentation and Loss Functions for Deep Learning Based Speech-Background Separation

Hakan Erdogan, Takuya Yoshioka


A successful deep learning-based method for separation of a speech signal from an interfering background audio signal is based on neural network prediction of time-frequency masks which multiply noisy signal's short-time Fourier transform (STFT) to yield the STFT of an enhanced signal. In this paper, we investigate training strategies for mask-prediction-based speech-background separation systems. First, we examine the impact of mixing speech and noise files on the fly during training, which enables models to be trained on virtually infinite amount of data. We also investigate the effect of using a novel signal-to-noise ratio related loss function, instead of mean-squared error which is prone to scaling differences among utterances. We evaluate bi-directional long-short term memory (BLSTM) networks as well as a combination of convolutional and BLSTM (CNN+BLSTM) networks for mask prediction and compare performances of real and complex-valued mask prediction. Data-augmented training combined with a novel loss function yields significant improvements in signal to distortion ratio (SDR) and perceptual evaluation of speech quality (PESQ) as compared to the best published result on CHiME-2 medium vocabulary data set when using a CNN+BLSTM network.


 DOI: 10.21437/Interspeech.2018-2441

Cite as: Erdogan, H., Yoshioka, T. (2018) Investigations on Data Augmentation and Loss Functions for Deep Learning Based Speech-Background Separation. Proc. Interspeech 2018, 3499-3503, DOI: 10.21437/Interspeech.2018-2441.


@inproceedings{Erdogan2018,
  author={Hakan Erdogan and Takuya Yoshioka},
  title={Investigations on Data Augmentation and Loss Functions for Deep Learning Based Speech-Background Separation},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3499--3503},
  doi={10.21437/Interspeech.2018-2441},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2441}
}