This paper presents the Tencent speech synthesis system for Blizzard Challenge 2020. The corpus released to the participants this year included a TV’s news broadcasting corpus with a length around 8 hours by a Chinese male host (2020-MH1 task), and a Shanghaiese speech corpus with a length around 6 hours (2020-SS1 task). We built a DurIAN-based speech synthesis system for 2020-MH1 task and Tacotron-based system for 2020-SS1 task. For 2020-MH1 task, firstly, a multi-speaker DurIAN-based acoustic model was trained based on linguistic feature to predict mel spectrograms. Then the model was fine-tuned on only the corpus provided. For 2020-SS1 task, instead of training based on hard-aligned phone boundaries, a Tacotron-like end-to-end system is applied to learn the mappings between phonemes and mel spectrograms. Finally, a modified version of WaveRNN model conditioning on the predicted mel spectrograms is trained to generate speech waveform. Our team is identified as L and the evaluation results shows our systems perform very well in various tests. Especially, we took the first place in the overall speech intelligibility test.
Cite as: Tian, Q., Zhang, Z., Chen, L.-H., Lu, H., Yu, C., Weng, C., Yu, D. (2020) The Tencent speech synthesis system for Blizzard Challenge 2020. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 28-32, doi: 10.21437/VCCBC.2020-4
@inproceedings{tian20_vccbc, author={Qiao Tian and Zewang Zhang and Ling-Hui Chen and Heng Lu and Chengzhu Yu and Chao Weng and Dong Yu}, title={{The Tencent speech synthesis system for Blizzard Challenge 2020}}, year=2020, booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020}, pages={28--32}, doi={10.21437/VCCBC.2020-4} }