Masking Estimation with Phase Restoration of Clean Speech for Monaural Speech Enhancement

Xianyun Wang, Changchun Bao


Deep neural network (DNN) has become a popular means for separating target speech from noisy speech due to its good performance for learning a mapping relationship between the training target and noisy speech. For the DNN-based methods, the time-frequency (T-F) mask commonly used as the training target has a significant impact on the performance of speech restoration. However, the T-F mask generally modifies magnitude spectrum of noisy speech and leaves phase spectrum unchanged in enhancing process. The recent studies have revealed that incorporating phase spectrum information into the T-F mask can effectively improve perceptual quality of the enhanced speech. So, in this paper, we present two T-F masks to simultaneously enhance magnitude and phase of speech spectrum based on non-correlation assumption of real part and imaginary part about speech spectrum, and use them as the training target of the DNN model. Experimental results show that, in comparison with the reference methods, the proposed method can obtain an effective improvement in speech quality for different signal to noise ratio (SNR) conditions.


 DOI: 10.21437/Interspeech.2019-1141

Cite as: Wang, X., Bao, C. (2019) Masking Estimation with Phase Restoration of Clean Speech for Monaural Speech Enhancement. Proc. Interspeech 2019, 3188-3192, DOI: 10.21437/Interspeech.2019-1141.


@inproceedings{Wang2019,
  author={Xianyun Wang and Changchun Bao},
  title={{Masking Estimation with Phase Restoration of Clean Speech for Monaural Speech Enhancement}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3188--3192},
  doi={10.21437/Interspeech.2019-1141},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1141}
}