UNetGAN: A Robust Speech Enhancement Approach in Time Domain for Extremely Low Signal-to-Noise Ratio Condition

Xiang Hao, Xiangdong Su, Zhiyu Wang, Hui Zhang, Batushiren


Speech enhancement at extremely low signal-to-noise ratio (SNR) condition is a very challenging problem and rarely investigated in previous works. This paper proposes a robust speech enhancement approach (UNetGAN) based on U-Net and generative adversarial learning to deal with this problem. This approach consists of a generator network and a discriminator network, which operate directly in the time domain. The generator network adopts a U-Net like structure and employs dilated convolution in the bottleneck of it. We evaluate the performance of the UNetGAN at low SNR conditions (up to -20dB) on the public benchmark. The result demonstrates that it significantly improves the speech quality and substantially outperforms the representative deep learning models, including SEGAN, cGAN fo SE, Bidirectional LSTM using phase-sensitive spectrum approximation cost function (PSA-BLSTM) and Wave-U-Net regarding Short-Time Objective Intelligibility (STOI) and Perceptual evaluation of speech quality (PESQ).


 DOI: 10.21437/Interspeech.2019-1567

Cite as: Hao, X., Su, X., Wang, Z., Zhang, H., Batushiren, . (2019) UNetGAN: A Robust Speech Enhancement Approach in Time Domain for Extremely Low Signal-to-Noise Ratio Condition. Proc. Interspeech 2019, 1786-1790, DOI: 10.21437/Interspeech.2019-1567.


@inproceedings{Hao2019,
  author={Xiang Hao and Xiangdong Su and Zhiyu Wang and Hui Zhang and  Batushiren},
  title={{UNetGAN: A Robust Speech Enhancement Approach in Time Domain for Extremely Low Signal-to-Noise Ratio Condition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1786--1790},
  doi={10.21437/Interspeech.2019-1567},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1567}
}