All-Neural Multi-Channel Speech Enhancement

Zhong-Qiu Wang, DeLiang Wang


This study proposes a novel all-neural approach for multi-channel speech enhancement, where robust speaker localization, acoustic beamforming, post-filtering and spatial filtering are all done using deep learning based time-frequency (T-F) masking. Our system first performs monaural speech enhancement on each microphone signal to obtain the estimated ideal ratio masks for beamforming and robust time delay of arrival (TDOA) estimation. Then with the estimated TDOA, directional features indicating whether each T-F unit is dominated by the signal coming from the estimated target direction are computed. Next, the directional features are combined with the spectral features extracted from the beamformed signal to achieve further enhancement. Experiments on a two-microphone setup in reverberant environments with strong diffuse babble noise demonstrate the effectiveness of the proposed approach for multi-channel speech enhancement.


 DOI: 10.21437/Interspeech.2018-1664

Cite as: Wang, Z., Wang, D. (2018) All-Neural Multi-Channel Speech Enhancement. Proc. Interspeech 2018, 3234-3238, DOI: 10.21437/Interspeech.2018-1664.


@inproceedings{Wang2018,
  author={Zhong-Qiu Wang and DeLiang Wang},
  title={All-Neural Multi-Channel Speech Enhancement},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3234--3238},
  doi={10.21437/Interspeech.2018-1664},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1664}
}