Multi-Target Ensemble Learning for Monaural Speech Separation

Hui Zhang, Xueliang Zhang, Guanglai Gao


Speech separation can be formulated as a supervised learning problem where a machine is trained to cast the acoustic features of the noisy speech to a time-frequency mask, or the spectrum of the clean speech. These two categories of speech separation methods can be generally referred as the masking-based and the mapping-based methods, but none of them can perfectly estimate the clean speech, since any target can only describe a part of the characteristics of the speech. However, the estimated masks and speech spectrum can, sometimes, be complementary as the speech is described from different perspectives. In this paper, by adopting an ensemble framework, a multi-target deep neural network (DNN) based method is proposed, which combines the masking-based and the mapping-based strategies, and the DNN is trained to jointly estimate the time-frequency masks and the clean spectrum. We show that as expected the mask and speech spectrum based targets yield partly complementary estimates, and the separation performance can be improved by merging these estimates. Furthermore, a merging model trained jointly with the multi-target DNN is developed. Experimental results indicate that the proposed multi-target DNN based method outperforms the DNN based algorithm which optimizes a single target.


 DOI: 10.21437/Interspeech.2017-240

Cite as: Zhang, H., Zhang, X., Gao, G. (2017) Multi-Target Ensemble Learning for Monaural Speech Separation. Proc. Interspeech 2017, 1958-1962, DOI: 10.21437/Interspeech.2017-240.


@inproceedings{Zhang2017,
  author={Hui Zhang and Xueliang Zhang and Guanglai Gao},
  title={Multi-Target Ensemble Learning for Monaural Speech Separation},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1958--1962},
  doi={10.21437/Interspeech.2017-240},
  url={http://dx.doi.org/10.21437/Interspeech.2017-240}
}