Acoustic-dependent Phonemic Transcription for Text-to-speech Synthesis

Kévin Vythelingum, Yannick Estève, Olivier Rosec


Text-to-speech synthesis (TTS) purpose is to produce a speech signal from an input text. This implies the annotation of speech recordings with word and phonemic transcriptions. The overall quality of TTS highly depends on the accuracy of phonemic transcriptions. However, they are generally automatically produced by grapheme-to-phoneme conversion systems, which don't deal with speaker variability. In this work, we explore ways to obtain signal-dependent phonemic transcriptions. We investigate forced-alignment with enriched pronunciation lexicon and multimodal phonemic transcription. We then apply our results on error detection of grapheme-to-phoneme conversion hypotheses in order to find where the phonemic transcriptions may be erroneous. On a French TTS dataset, we show that we can detect up to 90.5% of errors of a state-of-the-art grapheme-to-phoneme conversion system by annotating less than 15.8% of phonemes as erroneous. This can help a human annotator to correct most of grapheme-to-phoneme conversion errors without checking a lot of data. In other words, our method can significantly reduce the cost of high quality TTS data creation.


 DOI: 10.21437/Interspeech.2018-1306

Cite as: Vythelingum, K., Estève, Y., Rosec, O. (2018) Acoustic-dependent Phonemic Transcription for Text-to-speech Synthesis. Proc. Interspeech 2018, 2489-2493, DOI: 10.21437/Interspeech.2018-1306.


@inproceedings{Vythelingum2018,
  author={Kévin Vythelingum and Yannick Estève and Olivier Rosec},
  title={Acoustic-dependent Phonemic Transcription for Text-to-speech Synthesis},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2489--2493},
  doi={10.21437/Interspeech.2018-1306},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1306}
}