A Comparison of Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis

Jason Fong, Jason Taylor, Korin Richmond, Simon King


Neural sequence-to-sequence (S2S) models for text-tospeech synthesis (TTS) may take letter or phone input sequences. Since for many languages phones have a more direct relationship to the acoustic signal, they lead to improved quality. But generating phone transcriptions from text requires an expensive dictionary and an error-prone grapheme-to-phoneme (G2P) model, and the relative improvement over using letters has yet to be quantified. In approaching this question, we presume that letter-input S2S models must implicitly learn an internal counterpart to G2P conversion and therefore inevitably make errors. Such a model may thus be viewed as phone-input S2S with inaccurate phone input. To quantify this inaccuracy, we compare in this paper a letter-input S2S system to several phone-input systems trained on data with a varying level of error in the phonetic transcription. Our findings show our letterinput system is equivalent in quality to the phone-input system in which 25\% of word tokens in the training data have incorrect phonetic transcriptions. Furthermore, we find that for phoneinput systems up to 15\% of word tokens in the training data can have incorrect phonetic transcriptions without any significant difference in performance to a 0\% error rate system. This suggests it is acceptable to use G2P to predict pronunciations for out-of-vocabulary words (OOVs) provided they are less than around 15\% of the training data, removing the need to manually add OOVs to the dictionary for every new training set.


 DOI: 10.21437/SSW.2019-40

Cite as: Fong, J., Taylor, J., Richmond, K., King, S. (2019) A Comparison of Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis. Proc. 10th ISCA Speech Synthesis Workshop, 223-227, DOI: 10.21437/SSW.2019-40.


@inproceedings{Fong2019,
  author={Jason Fong and Jason Taylor and Korin Richmond and Simon King},
  title={{A Comparison of Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={223--227},
  doi={10.21437/SSW.2019-40},
  url={http://dx.doi.org/10.21437/SSW.2019-40}
}