ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS

Ye Jia, Heiga Zen, Jonathan Shen, Yu Zhang, Yonghui Wu

This paper introduces PnG BERT, a new encoder model for neural TTS. This model is augmented from the original BERT model, by taking both phoneme and grapheme representations of text as input, as well as the word-level alignment between them. It can be pre-trained on a large text corpus in a self-supervised manner, and fine-tuned in a TTS task. Experimental results show that a neural TTS model using a pre-trained PnG BERT as its encoder yields more natural prosody and more accurate pronunciation than a baseline model using only phoneme input with no pre-training. Subjective side-by-side preference evaluations show that raters have no statistically significant preference between the speech synthesized using a PnG BERT and ground truth recordings from professional speakers.


doi: 10.21437/Interspeech.2021-1757

Cite as: Jia, Y., Zen, H., Shen, J., Zhang, Y., Wu, Y. (2021) PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS. Proc. Interspeech 2021, 151-155, doi: 10.21437/Interspeech.2021-1757

@inproceedings{jia21_interspeech,
  author={Ye Jia and Heiga Zen and Jonathan Shen and Yu Zhang and Yonghui Wu},
  title={{PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={151--155},
  doi={10.21437/Interspeech.2021-1757}
}