Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data

Erica Cooper, Xinyue Wang, Alison Chang, Yocheved Levitan, Julia Hirschberg


This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recognition (ASR) training. We compare a number of filtering techniques designed to identify the best utterances from a noisy, multi-speaker corpus for training voices, to exclude speech containing noise and to include speech close in nature to more traditionally-collected TTS corpora. We also evaluate the use of automatic speech recognizers for intelligibility assessment in comparison with crowdsourcing methods. While the goal of this work is to develop natural-sounding and intelligible TTS voices in Low Resource Languages (LRLs) rapidly and easily, without the expense of recording data specifically for this purpose, we focus on English initially to identify the best filtering techniques and evaluation methods. We find that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.


 DOI: 10.21437/Interspeech.2017-465

Cite as: Cooper, E., Wang, X., Chang, A., Levitan, Y., Hirschberg, J. (2017) Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data. Proc. Interspeech 2017, 3971-3975, DOI: 10.21437/Interspeech.2017-465.


@inproceedings{Cooper2017,
  author={Erica Cooper and Xinyue Wang and Alison Chang and Yocheved Levitan and Julia Hirschberg},
  title={Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3971--3975},
  doi={10.21437/Interspeech.2017-465},
  url={http://dx.doi.org/10.21437/Interspeech.2017-465}
}