The Duke Entry for 2020 Blizzard Challenge

Zexin Cai, Ming Li


This paper presents the speech synthesis system built for the 2020 Blizzard Challenge by team ‘H’. The goal of the challenge is to build a synthesizer that is able to generate high-fidelity speech with a voice that is similar to the one from the provided data. Our system mainly draws on end-to-end neural networks. Specifically, we have an encoder-decoder based prosody prediction network to insert prosodic annotations for a given sentence. We use the spectrogram predictor from Tacotron2 as the end-to-end phoneme-to-spectrogram generator, followed by the neural vocoder WaveRNN to convert predicted spectrograms to audio signals. Additionally, we involve finetuning strategics to improve the TTS performance in our work. Subjective evaluation of the synthetic audios is taken regarding naturalness, similarity, and intelligibility. Samples are available online for listening.


 DOI: 10.21437/VCC_BC.2020-5

Cite as: Cai, Z., Li, M. (2020) The Duke Entry for 2020 Blizzard Challenge. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 33-37, DOI: 10.21437/VCC_BC.2020-5.


@inproceedings{Cai2020,
  author={Zexin Cai and Ming Li},
  title={{The Duke Entry for 2020 Blizzard Challenge}},
  year=2020,
  booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020},
  pages={33--37},
  doi={10.21437/VCC_BC.2020-5},
  url={http://dx.doi.org/10.21437/VCC_BC.2020-5}
}