ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings

Lukas Drude, Reinhold Haeb-Umbach

Recent advances in discriminatively trained mask estimation networks to extract a single source utilizing beamforming techniques demonstrate, that the integration of statistical models and deep neural networks (DNNs) are a promising approach for robust automatic speech recognition (ASR) applications. In this contribution we demonstrate how discriminatively trained embeddings on spectral features can be tightly integrated into statistical model-based source separation to separate and transcribe overlapping speech. Good generalization to unseen spatial configurations is achieved by estimating a statistical model at test time, while still leveraging discriminative training of deep clustering embeddings on a separate training set. We formulate an expectation maximization (EM) algorithm which jointly estimates a model for deep clustering embeddings and complex-valued spatial observations in the short time Fourier transform (STFT) domain at test time. Extensive simulations confirm, that the integrated model outperforms (a) a deep clustering model with a subsequent beamforming step and (b) an EM-based model with a beamforming step alone in terms of signal to distortion ratio (SDR) and perceptually motivated metric (PESQ) gains. ASR results on a reverberated dataset further show, that the aforementioned gains translate to reduced word error rates (WERs) even in reverberant environments.


doi: 10.21437/Interspeech.2017-187

Cite as: Drude, L., Haeb-Umbach, R. (2017) Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings. Proc. Interspeech 2017, 2650-2654, doi: 10.21437/Interspeech.2017-187

@inproceedings{drude17_interspeech,
  author={Lukas Drude and Reinhold Haeb-Umbach},
  title={{Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings}},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2650--2654},
  doi={10.21437/Interspeech.2017-187}
}