Data Selection and Adaptation for Naturalness in HMM-Based Speech Synthesis

Erica Cooper, Alison Chang, Yocheved Levitan, Julia Hirschberg


We describe experiments in building HMM text-to-speech voices on professional broadcast news data from multiple speakers. We build on earlier work comparing techniques for selecting utterances from the corpus and voice adaptation to produce the most natural-sounding voices. While our ultimate goal is to develop intelligible and natural-sounding synthetic voices in low-resource languages rapidly and without the expense of collecting and annotating data specifically for text-to-speech, we focus on English initially, in order to develop and evaluate our methods. We evaluate our approaches using crowdsourced listening tests for naturalness. We have found that removing utterances that are outliers with respect to hyper-articulation, as well as combining the selection of hypo-articulated utterances and low mean f0 utterances, produce the most natural-sounding voices.


DOI: 10.21437/Interspeech.2016-502

Cite as

Cooper, E., Chang, A., Levitan, Y., Hirschberg, J. (2016) Data Selection and Adaptation for Naturalness in HMM-Based Speech Synthesis. Proc. Interspeech 2016, 357-361.

Bibtex
@inproceedings{Cooper+2016,
author={Erica Cooper and Alison Chang and Yocheved Levitan and Julia Hirschberg},
title={Data Selection and Adaptation for Naturalness in HMM-Based Speech Synthesis},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-502},
url={http://dx.doi.org/10.21437/Interspeech.2016-502},
pages={357--361}
}