Submission from SCUT for Blizzard Challenge 2020

Yitao Yang, Jinghui Zhong, Shehui Bu


In this paper, we describe the SCUT text-to-speech synthesis system for the Blizzard Challenge 2020 and the task is to build a voice from the provided Mandarin dataset. We begin with our system architecture composed of an end-to-end structure to convert acoustic features from textual sequences and a WaveRNN vocoder to restore the waveform. Then a BERT-based prosody prediction model to specify the prosodic information of the content is introduced. The text processing module is adjusted to uniformly encode both Mandarin and English texts, then a two-stage training method is utilized to build a bilingual speech synthesis system. Meanwhile, we employ forward attention and guided attention mechanisms to accelerate the model’s convergence. Finally, the reasons for our inefficient performance presented in the evaluation results are discussed.


 DOI: 10.21437/VCC_BC.2020-6

Cite as: Yang, Y., Zhong, J., Bu, S. (2020) Submission from SCUT for Blizzard Challenge 2020. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 38-43, DOI: 10.21437/VCC_BC.2020-6.


@inproceedings{Yang2020,
  author={Yitao Yang and Jinghui Zhong and Shehui Bu},
  title={{Submission from SCUT for Blizzard Challenge 2020}},
  year=2020,
  booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020},
  pages={38--43},
  doi={10.21437/VCC_BC.2020-6},
  url={http://dx.doi.org/10.21437/VCC_BC.2020-6}
}