The Ximalaya TTS System for Blizzard Challenge 2020

Zhiba Su, Wendi He, Yang Sun


This paper describes the proposed Himalaya text-to-speech synthesis system built for the Blizzard Challenge 2020. The two tasks are to build expressive speech synthesizers based on the released 9.5-hour Mandarin corpus from a male native speaker and 3-hour Shanghainese corpus from a female native speaker respectively. Our architecture is Tacotron2-based acoustic model with WaveRNN vocoder. Several methods for preprocessing and checking the raw BC transcript are implemented. Firstly, the multi-task TTS front-end module transforms the text sequences into phoneme-level sequences with prosody label after implement the polyphonic disambiguation and prosody prediction module. Then, we train the released corpus on a Seq2seq multi-speaker acoustic model for Mel spectrograms modeling. Besides, the neural vocoder WaveRNN with minor improvements generate high-quality audio for the submitted results. The identifier for our system is M, and the experimental evaluation results in listening tests show that the system we submitted performed well in most of the criterion.


 DOI: 10.21437/VCC_BC.2020-10

Cite as: Su, Z., He, W., Sun, Y. (2020) The Ximalaya TTS System for Blizzard Challenge 2020. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 59-63, DOI: 10.21437/VCC_BC.2020-10.


@inproceedings{Su2020,
  author={Zhiba Su and Wendi He and Yang Sun},
  title={{The Ximalaya TTS System for Blizzard Challenge 2020}},
  year=2020,
  booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020},
  pages={59--63},
  doi={10.21437/VCC_BC.2020-10},
  url={http://dx.doi.org/10.21437/VCC_BC.2020-10}
}