The HITSZ TTS system for Blizzard challenge 2020

Huhao Fu, Yiben Zhang, Kailong Liu, Chao Liu


In this paper, we present the techniques that were used in HITSZ-TTS entry in Blizzard Challenge 2020. The corpus released to the participants this year is about 10-hours speech recordings from a Chinese male speaker with mixed Mandarin and English speech. Based on the above situation, we build an end to end speech synthesis system for this task. It is divided into the following parts: (1) the front-end module to analyze the pronunciation and prosody of text; (2) The phoneme-converted tool; (3) The forward-attention based sequence-to-sequence acoustic model with jointly learning with prosody labels to predict 80-dimensional Mel-spectrogram; (4) The Parallel WaveGAN based neural vocoder to reconstruct waveforms. This is the first time for us to join the Blizzard Challenge, and the identifier for our system is G. The evaluation results of subjective listening tests show that the proposed system achieves unsatisfactory performance. The problems in the system are also discussed in this paper.


 DOI: 10.21437/VCC_BC.2020-11

Cite as: Fu, H., Zhang, Y., Liu, K., Liu, C. (2020) The HITSZ TTS system for Blizzard challenge 2020. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 64-69, DOI: 10.21437/VCC_BC.2020-11.


@inproceedings{Fu2020,
  author={Huhao Fu and Yiben Zhang and Kailong Liu and Chao Liu},
  title={{The HITSZ TTS system for Blizzard challenge 2020}},
  year=2020,
  booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020},
  pages={64--69},
  doi={10.21437/VCC_BC.2020-11},
  url={http://dx.doi.org/10.21437/VCC_BC.2020-11}
}