This paper introduces the SHNU (team I) speech synthesis system for Blizzard Challenge 2020. Speech data released this year includes two parts: a 9.5-hour Mandarin corpus from a male native speaker and a 3-hour Shanghainese corpus from a female native speaker. Based on these corpora, we built two neural network-based speech synthesis systems to synthesize speech for both tasks. The same system architecture was used for both the Mandarin and Shanghainese tasks. Specifically, our systems include a front-end module, a Tacotron-based spectrogram pre-diction network and a WaveNet-based neural vocoder. Firstly, a pre-built front-end module was used to generate character sequence and linguistic features from the training text. Then, we applied a Tacotron-based sequence-to-sequence model to generate mel-spectrogram from character sequence. Finally, a WaveNet-based neural vocoder was adopted to reconstruct audio waveform with the mel-spectrogram from Tacotron. Evaluation results demonstrated that our system achieved an extremely good performance on both tasks, which proved the effectiveness of our proposed system.
Cite as: He, L., Shi, Q., Wu, L., Sun, J., He, R., Long, Y., Liang, J. (2020) The SHNU System for Blizzard Challenge 2020. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 19-23, doi: 10.21437/VCCBC.2020-2
@inproceedings{he20_vccbc, author={Laipeng He and Qiang Shi and Lang Wu and Jianqing Sun and Renke He and Yanhua Long and Jiaen Liang}, title={{The SHNU System for Blizzard Challenge 2020}}, year=2020, booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020}, pages={19--23}, doi={10.21437/VCCBC.2020-2} }