Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation

Jitong Chen, DeLiang Wang


Speech separation can be formulated as a supervised learning problem where a time-frequency mask is estimated by a learning machine from acoustic features of noisy speech. Deep neural networks (DNNs) have been successful for noise generalization in supervised separation. However, real world applications desire a trained model to perform well with both unseen speakers and unseen noises. In this study we investigate speaker generalization for noise-independent models and propose a separation model based on long short-term memory to account for the temporal dynamics of speech. Our experiments show that the proposed model significantly outperforms a DNN in terms of objective speech intelligibility for both seen and unseen speakers. Compared to feedforward networks, the proposed model is more capable of modeling a large number of speakers, and represents an effective approach for speaker- and noise-independent speech separation.


DOI: 10.21437/Interspeech.2016-551

Cite as

Chen, J., Wang, D. (2016) Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation. Proc. Interspeech 2016, 3314-3318.

Bibtex
@inproceedings{Chen+2016,
author={Jitong Chen and DeLiang Wang},
title={Long Short-Term Memory for Speaker Generalization in Supervised Speech Separation},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-551},
url={http://dx.doi.org/10.21437/Interspeech.2016-551},
pages={3314--3318}
}