Recent advances in discriminatively trained mask estimation networks to extract a single source utilizing beamforming techniques demonstrate, that the integration of statistical models and deep neural networks (DNNs) are a promising approach for robust automatic speech recognition (ASR) applications. In this contribution we demonstrate how discriminatively trained embeddings on spectral features can be tightly integrated into statistical model-based source separation to separate and transcribe overlapping speech. Good generalization to unseen spatial configurations is achieved by estimating a statistical model at test time, while still leveraging discriminative training of deep clustering embeddings on a separate training set. We formulate an expectation maximization (EM) algorithm which jointly estimates a model for deep clustering embeddings and complex-valued spatial observations in the short time Fourier transform (STFT) domain at test time. Extensive simulations confirm, that the integrated model outperforms (a) a deep clustering model with a subsequent beamforming step and (b) an EM-based model with a beamforming step alone in terms of signal to distortion ratio (SDR) and perceptually motivated metric (PESQ) gains. ASR results on a reverberated dataset further show, that the aforementioned gains translate to reduced word error rates (WERs) even in reverberant environments.
Cite as: Drude, L., Haeb-Umbach, R. (2017) Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings. Proc. Interspeech 2017, 2650-2654, doi: 10.21437/Interspeech.2017-187
@inproceedings{drude17_interspeech, author={Lukas Drude and Reinhold Haeb-Umbach}, title={{Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2650--2654}, doi={10.21437/Interspeech.2017-187} }