Which Ones Are Speaking? Speaker-Inferred Model for Multi-Talker Speech Separation

Jing Shi, Jiaming Xu, Bo Xu


Recent deep learning methods have gained noteworthy success in the multi-talker mixed speech separation task, which is also famous known as the Cocktail Party Problem. However, most existing models are well-designed towards some predefined conditions, which make them unable to handle the complex auditory scene automatically, such as a variable and unknown number of speakers in the mixture. In this paper, we propose a speaker-inferred model, based on the flexible and efficient Seq2Seq generation model, to accurately infer the possible speakers and the speech channel of each. Our model is totally end-to-end with several different modules to emphasize and better utilize the information from speakers. Without a priori knowledge about the number of speakers or any additional curriculum training strategy or man-made rules, our method gets comparable performance with those strong baselines.


 DOI: 10.21437/Interspeech.2019-1591

Cite as: Shi, J., Xu, J., Xu, B. (2019) Which Ones Are Speaking? Speaker-Inferred Model for Multi-Talker Speech Separation. Proc. Interspeech 2019, 4609-4613, DOI: 10.21437/Interspeech.2019-1591.


@inproceedings{Shi2019,
  author={Jing Shi and Jiaming Xu and Bo Xu},
  title={{Which Ones Are Speaking? Speaker-Inferred Model for Multi-Talker Speech Separation}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4609--4613},
  doi={10.21437/Interspeech.2019-1591},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1591}
}