The paper presents the RoyalFlush synthesis system for Blizzard Challenge 2020. Two required voices are built from the released Mandarin and Shanghainese data. Based on end-to-end speech synthesis technology, some improvements are introduced to the system compared with our system of last year. Firstly, a Mandarin front-end transforming input text in to phoneme sequence along with prosody labels is employed. Then, to improve speech stability, a modified Tacotron acoustic model is proposed. Moreover, we apply GMM-based attention mechanism for robust long-form speech synthesis. Finally, a lightweight LPCNet-based neural vocoder is adopted to achieve a nice traceoff between effectiveness and efficiency. Among all the participating teams of the Challenge, the identifier for our system is N. Evaluation results demonstrates that our system performs relatively well in intelligibility. But it still needs to be improved in terms of naturalness and similarity.
Cite as: Lu, J., Lu, Z., He, T., Zhang, P., Hu, X., Xu, X. (2020) The RoyalFlush Synthesis System for Blizzard Challenge 2020. Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 54-58, doi: 10.21437/VCCBC.2020-9
@inproceedings{lu20_vccbc, author={Jian Lu and Zeru Lu and Ting He and Peng Zhang and Xinhui Hu and Xinkang Xu}, title={{The RoyalFlush Synthesis System for Blizzard Challenge 2020}}, year=2020, booktitle={Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020}, pages={54--58}, doi={10.21437/VCCBC.2020-9} }