Target Speaker Recovery and Recognition Network with Average x-Vector and Global Training

Wenjie Li, Pengyuan Zhang, Yonghong Yan


It is very challenging to do multi-talker automatic speech recognition (ASR). Some speaker-aware selective methods have been proposed to recover the speech of the target speaker, relying on the auxiliary speaker information provided by an anchor (a clean audio sample of the target speaker). But the performance is unstable depending on the quality of the provided anchors. To address this limitation, we propose to take advantage of the average speaker embeddings to build the target speaker recovery network (TRnet). The TRnet takes the mixed speech and the stable average speaker embeddings to produce the TF masks for the target speech. During training of the TRnet, we summarize the speaker embeddings on the whole training dataset for each speaker, instead of extracting on a randomly picked anchor. On the testing stage, one or very few anchors are enough to get decent recovery results. The results of the TRnet trained with average speaker embeddings show 13% and 12.5% relative improvements on WER and SDR, compared with the short-anchor trained model. Moreover, to mitigate the mismatch between the TRnet and the acoustic model (AM), we adopted two strategies: fine-tuning the AM and training an global TRnet. Both of them bring considerable reductions on WER. The results show that the global trained framework gets superior performance.


 DOI: 10.21437/Interspeech.2019-1692

Cite as: Li, W., Zhang, P., Yan, Y. (2019) Target Speaker Recovery and Recognition Network with Average x-Vector and Global Training. Proc. Interspeech 2019, 3233-3237, DOI: 10.21437/Interspeech.2019-1692.


@inproceedings{Li2019,
  author={Wenjie Li and Pengyuan Zhang and Yonghong Yan},
  title={{Target Speaker Recovery and Recognition Network with Average x-Vector and Global Training}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3233--3237},
  doi={10.21437/Interspeech.2019-1692},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1692}
}