Boosting Character-Based Chinese Speech Synthesis via Multi-Task Learning and Dictionary Tutoring

Yuxiang Zou, Linhao Dong, Bo Xu


Recent character-based end-to-end text-to-speech (TTS) systems have shown promising performance in natural speech generation, especially for English. However, for Chinese TTS, the character-based model is easy to generate speech with wrong pronunciation due to the label sparsity issue. To address this issue, we introduce an additional learning task of character-to-pinyin mapping to boost the pronunciation learning of characters, and leverage a pre-trained dictionary network to correct the pronunciation mistake through joint training. Specifically, our model predicts pinyin labels as an auxiliary task to assist learning better hidden representations of Chinese characters, where pinyin is a standard phonetic representation for Chinese characters. The dictionary network plays a role as a tutor to further help hidden representation learning. Experiments demonstrate that employing the pinyin auxiliary task and an external dictionary network clearly enhances the naturalness and intelligibility of the synthetic speech directly from the Chinese character sequences.


 DOI: 10.21437/Interspeech.2019-3233

Cite as: Zou, Y., Dong, L., Xu, B. (2019) Boosting Character-Based Chinese Speech Synthesis via Multi-Task Learning and Dictionary Tutoring. Proc. Interspeech 2019, 2055-2059, DOI: 10.21437/Interspeech.2019-3233.


@inproceedings{Zou2019,
  author={Yuxiang Zou and Linhao Dong and Bo Xu},
  title={{Boosting Character-Based Chinese Speech Synthesis via Multi-Task Learning and Dictionary Tutoring}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2055--2059},
  doi={10.21437/Interspeech.2019-3233},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3233}
}