Recent progress in speech separation shows that deep neural networks (DNN) based supervised methods can improve the performance in difficult noise conditions and exhibit good generalization to unseen noise scenarios. However, existing approaches do not explore contextual information sufficiently. In this paper, we focus on exploring contextual information using DNN. The proposed method has two parts — a multi-resolution stacking (MRS) framework and a boosted DNN (bDNN) classifier. The MRS framework trains a stack of classifier ensembles, where each classifier in an ensemble concatenates the raw acoustic feature and the outputs of its bottom ensemble as a new feature, and different classifiers in an ensemble work with different window lengths. The bDNN classifier first generates multiple base predictions for a frame from a given window that is centered on the frame and contains multiple neighboring frames, and then aggregates the base predictions for the final prediction. Our experimental comparison with DNN based speech separation in difficult noise scenarios demonstrates the effectiveness of the proposed method in terms of both prediction accuracy and objective speech intelligibility.
Bibliographic reference. Zhang, Xiao-Lei / Wang, DeLiang (2015): "Multi-resolution stacking for speech separation based on boosted DNN", In INTERSPEECH-2015, 1745-1749.