In this paper, we introduce the text-to-speech system from Sogou team submitted to Blizzard Challenge 2020. The goal of this year’s challenge is to build a natural Mandarin Chinese speech synthesis system from the 10-hours corpus by a native Chinese male speaker. We will discuss the major modules of the submitted system: (1) the front-end module to analyze the pronunciation and prosody of text; (2) the FastSpeech-based sequence-to-sequence acoustic model to predict acoustic features; (3) the WaveRNN based neural vocoder to reconstruct waveforms. Evaluation results provided by the challenge organizer are also discussed.
Cite as: Meng, F., Wang, R., Fang, P., Zou, S., Duan, W., Zhou, M., Liu, K., Chen, W. (2020) The Sogou System for Blizzard Challenge 2020. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 49-53, doi: 10.21437/VCCBC.2020-8
@inproceedings{meng20_vccbc, author={Fanbo Meng and Ruimin Wang and Peng Fang and Shuangyuan Zou and Wenjun Duan and Ming Zhou and Kai Liu and Wei Chen}, title={{The Sogou System for Blizzard Challenge 2020}}, year=2020, booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020}, pages={49--53}, doi={10.21437/VCCBC.2020-8} }