Multi-talker Speech Separation Based on Permutation Invariant Training and Beamforming

Lu Yin, Ziteng Wang, Risheng Xia, Junfeng Li, Yonghong Yan


The recently proposed Permutation Invariant Training (PIT) technique addresses the label permutation problem for multi-talker speech separation. It has shown to be effective for the single-channel separation case. In this paper, we propose to extend the PIT-based technique to the multichannel multi-talker speech separation scenario. PIT is used to train a neural network that outputs masks for each separate speaker which is followed by a Minimum Variance Distortionless Response (MVDR) beamformer. The beamformer utilizes the spatial information of different speakers and alleviates the performance degradation due to misaligned labels. Experimental results show that the proposed PIT-MVDR-based technique leads to higher Signal-to-Distortion Ratios (SDRs) compared to the single-channel speech separation method when tested on two-speaker and three-speaker mixtures.


 DOI: 10.21437/Interspeech.2018-1739

Cite as: Yin, L., Wang, Z., Xia, R., Li, J., Yan, Y. (2018) Multi-talker Speech Separation Based on Permutation Invariant Training and Beamforming. Proc. Interspeech 2018, 851-855, DOI: 10.21437/Interspeech.2018-1739.


@inproceedings{Yin2018,
  author={Lu Yin and Ziteng Wang and Risheng Xia and Junfeng Li and Yonghong Yan},
  title={Multi-talker Speech Separation Based on Permutation Invariant Training and Beamforming},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={851--855},
  doi={10.21437/Interspeech.2018-1739},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1739}
}