A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis

Kai-Zhan Lee, Erica Cooper, Julia Hirschberg


Building on previous work in subset selection of training data for text-to-speech (TTS), this work compares speaker-level and utterance-level selection of TTS training data, using acoustic features to guide selection. We find that speaker-based selection is more effective than utterance-based selection, regardless of whether selection is guided by a single feature or a combination of features. We use US English telephone data collected for automatic speech recognition to simulate the conditions of TTS training on low-resource languages. Our best voice achieves a human-evaluated WER of 29.0% on semantically-unpredictable sentences. This constitutes a significant improvement over our baseline voice trained on the same amount of randomly selected utterances, which performed at 42.4% WER. In addition to subjective voice evaluations with Amazon Mechanical Turk, we also explored objective voice evaluation using mel-cepstral distortion. We found that this measure correlates strongly with human evaluations of intelligibility, indicating that it may be a useful method to evaluate or pre-select voices in future work.


 DOI: 10.21437/Interspeech.2018-1313

Cite as: Lee, K., Cooper, E., Hirschberg, J. (2018) A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis. Proc. Interspeech 2018, 2873-2877, DOI: 10.21437/Interspeech.2018-1313.


@inproceedings{Lee2018,
  author={Kai-Zhan Lee and Erica Cooper and Julia Hirschberg},
  title={A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2873--2877},
  doi={10.21437/Interspeech.2018-1313},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1313}
}