Using Shifted Real Spectrum Mask as Training Target for Supervised Speech Separation

Yun Liu, Hui Zhang, Xueliang Zhang


Deep learning-based speech separation has been widely studied in recent years. Most of these kind approaches focus on recovering the magnitude spectrum of the target speech, but ignore the phase estimation. Recently, a method called shifted real spectrum (SRS) is proposed. Unlike the short-time Fourier transform (STFT), the SRS contains only real components which encode the phase information. In this paper, we propose several SRS-based masks and use them as the training target of deep neural networks. Experimental results show that the proposed target outperforms the commonly used masks computed on STFT in general.


 DOI: 10.21437/Interspeech.2018-1650

Cite as: Liu, Y., Zhang, H., Zhang, X. (2018) Using Shifted Real Spectrum Mask as Training Target for Supervised Speech Separation. Proc. Interspeech 2018, 1151-1155, DOI: 10.21437/Interspeech.2018-1650.


@inproceedings{Liu2018,
  author={Yun Liu and Hui Zhang and Xueliang Zhang},
  title={Using Shifted Real Spectrum Mask as Training Target for Supervised Speech Separation},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1151--1155},
  doi={10.21437/Interspeech.2018-1650},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1650}
}