A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation

Yannan Wang, Jun Du, Li-Rong Dai, Chin-Hui Lee


In contrast to the conventional minimum mean squared error (MMSE) training criterion for nonlinear spectral mapping based on deep neural networks (DNNs), we propose a probabilistic learning framework to estimate the DNN parameters for single-channel speech separation. A statistical analysis of the prediction error vector at the DNN output reveals that it follows a unimodal density for each log power spectral component. By characterizing the prediction error vector as a multivariate Gaussian density with zero mean vector and an unknown covariance matrix, we present a maximum likelihood (ML) approach to DNN parameter learning. Our experiments on the Speech Separation Challenge (SSC) corpus show that the proposed learning approach can achieve a better generalization capability and a faster convergence than MMSE-based DNN learning. Furthermore, we demonstrate that the ML-trained DNN consistently outperforms MMSE-trained DNN in all the objective measures of speech quality and intelligibility in single-channel speech separation.


 DOI: 10.21437/Interspeech.2017-830

Cite as: Wang, Y., Du, J., Dai, L., Lee, C. (2017) A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation. Proc. Interspeech 2017, 1178-1182, DOI: 10.21437/Interspeech.2017-830.


@inproceedings{Wang2017,
  author={Yannan Wang and Jun Du and Li-Rong Dai and Chin-Hui Lee},
  title={A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1178--1182},
  doi={10.21437/Interspeech.2017-830},
  url={http://dx.doi.org/10.21437/Interspeech.2017-830}
}