The Tencent speech synthesis system for Blizzard Challenge 2020

Qiao Tian, Zewang Zhang, Ling-Hui Chen, Heng Lu, Chengzhu Yu, Chao Weng, Dong Yu


This paper presents the Tencent speech synthesis system for Blizzard Challenge 2020. The corpus released to the participants this year included a TV’s news broadcasting corpus with a length around 8 hours by a Chinese male host (2020-MH1 task), and a Shanghaiese speech corpus with a length around 6 hours (2020-SS1 task). We built a DurIAN-based speech synthesis system for 2020-MH1 task and Tacotron-based system for 2020-SS1 task. For 2020-MH1 task, firstly, a multi-speaker DurIAN-based acoustic model was trained based on linguistic feature to predict mel spectrograms. Then the model was fine-tuned on only the corpus provided. For 2020-SS1 task, instead of training based on hard-aligned phone boundaries, a Tacotron-like end-to-end system is applied to learn the mappings between phonemes and mel spectrograms. Finally, a modified version of WaveRNN model conditioning on the predicted mel spectrograms is trained to generate speech waveform. Our team is identified as L and the evaluation results shows our systems perform very well in various tests. Especially, we took the first place in the overall speech intelligibility test.


 DOI: 10.21437/VCC_BC.2020-4

Cite as: Tian, Q., Zhang, Z., Chen, L., Lu, H., Yu, C., Weng, C., Yu, D. (2020) The Tencent speech synthesis system for Blizzard Challenge 2020. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 28-32, DOI: 10.21437/VCC_BC.2020-4.


@inproceedings{Tian2020,
  author={Qiao Tian and Zewang Zhang and Ling-Hui Chen and Heng Lu and Chengzhu Yu and Chao Weng and Dong Yu},
  title={{The Tencent speech synthesis system for Blizzard Challenge 2020}},
  year=2020,
  booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020},
  pages={28--32},
  doi={10.21437/VCC_BC.2020-4},
  url={http://dx.doi.org/10.21437/VCC_BC.2020-4}
}