Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis

Bing Yang, Jiaqi Zhong, Shan Liu


In this paper, we propose a novel method to improve the performance and robustness of the front-end text processing modules of Mandarin text-to-speech (TTS) synthesis. We use pre-trained text encoding models, such as the encoder of a transformer based NMT model and BERT, to extract the latent semantic representations of words or characters and use them as input features for tasks in the front-end of TTS systems. Our experiments on the tasks of Mandarin polyphone disambiguation and prosodic structure prediction show that the proposed method can significantly improve the performances. Specifically, we get an absolute improvement of 0.013 and 0.027 in F1 score for prosodic word prediction and prosodic phrase prediction respectively, and an absolute improvement of 2.44% in polyphone disambiguation compared to previous methods.


 DOI: 10.21437/Interspeech.2019-1418

Cite as: Yang, B., Zhong, J., Liu, S. (2019) Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis. Proc. Interspeech 2019, 4480-4484, DOI: 10.21437/Interspeech.2019-1418.


@inproceedings{Yang2019,
  author={Bing Yang and Jiaqi Zhong and Shan Liu},
  title={{Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4480--4484},
  doi={10.21437/Interspeech.2019-1418},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1418}
}