Sixth ISCA Workshop on Speech Synthesis
This paper compares the effect of two different voice corpus selection methods on the overall quality of unit selection-based text-to-speech (TTS) voices resulting from training on these corpora. The first selection method aims to maximize the coverage of stressed as well as unstressed diphones (phonologically balanced: Phonbal) while the second method simply selects sentences at random (Random). We show that, as expected, the Phonbal method results in better phonetic and phonological coverage for the training as well as unseen test sentences. However, we also provide evidence from an objective evaluation and a subjective listening test that the Random method results in an overall better voice quality when only automatic corpus annotation tools (such as forced alignment) are used, and potentially even with manual annotation. This result has general implications for the fast creation of TTS voices.
Full Paper Presentation (ppt)
Bibliographic reference. Lambert, Tanya / Braunschweiler, Norbert / Buchholz, Sabine (2007): "How (not) to select your voice corpus: random selection vs. phonologically balanced", In SSW6-2007, 264-269.