In contrast to the conventional minimum mean squared error (MMSE) training criterion for nonlinear spectral mapping based on deep neural networks (DNNs), we propose a probabilistic learning framework to estimate the DNN parameters for single-channel speech separation. A statistical analysis of the prediction error vector at the DNN output reveals that it follows a unimodal density for each log power spectral component. By characterizing the prediction error vector as a multivariate Gaussian density with zero mean vector and an unknown covariance matrix, we present a maximum likelihood (ML) approach to DNN parameter learning. Our experiments on the Speech Separation Challenge (SSC) corpus show that the proposed learning approach can achieve a better generalization capability and a faster convergence than MMSE-based DNN learning. Furthermore, we demonstrate that the ML-trained DNN consistently outperforms MMSE-trained DNN in all the objective measures of speech quality and intelligibility in single-channel speech separation.
Cite as: Wang, Y., Du, J., Dai, L.-R., Lee, C.-H. (2017) A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation. Proc. Interspeech 2017, 1178-1182, doi: 10.21437/Interspeech.2017-830
@inproceedings{wang17f_interspeech, author={Yannan Wang and Jun Du and Li-Rong Dai and Chin-Hui Lee}, title={{A Maximum Likelihood Approach to Deep Neural Network Based Nonlinear Spectral Mapping for Single-Channel Speech Separation}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1178--1182}, doi={10.21437/Interspeech.2017-830} }