ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Learning Speech Structure to Improve Time-Frequency Masks

Suliang Bu, Yunxin Zhao, Shaojun Wang, Mei Han

Time-frequency (TF) masks are widely used in speech enhancement (SE). However, accurately estimating TF masks from noisy speech remains a challenge to both statistical or neural network approaches. Statistical model-based mask estimation usually depends on a good parameter initialization, while NN-based mask estimation relies on setting proper and stable learning targets. To address these issues, we propose a novel approach to extracting TF speech structures from clean speech data, and partition a noisy speech spectrogram into mutually exclusive regions of core speech, core noise, and transition. Using such region targets derived from clean speech, we train bidirectional LSTM to learn region prediction from noisy speech, which is easier to do than mask prediction. The predicted regions can further be used in place of masks in beamforming, or integrated with statistical and NN based mask estimation to constrain mask values and model parameter updates. Our experimental results on ASR (CHiME-3) and SE (CHiME-3 and LibriSpeech) have demonstrated the effectiveness of our approach of learning speech region structure to improve TF masks.

doi: 10.21437/Interspeech.2021-1859

Cite as: Bu, S., Zhao, Y., Wang, S., Han, M. (2021) Learning Speech Structure to Improve Time-Frequency Masks. Proc. Interspeech 2021, 2731-2735, doi: 10.21437/Interspeech.2021-1859

  author={Suliang Bu and Yunxin Zhao and Shaojun Wang and Mei Han},
  title={{Learning Speech Structure to Improve Time-Frequency Masks}},
  booktitle={Proc. Interspeech 2021},