ISCA Archive SSW 2004
ISCA Archive SSW 2004

Improving TTS by higher agreement between predicted versus observed pronunciations

Yeon-Jun Kim, Ann Syrdal, Matthias Jilka

This paper looks at improving unit selection text-to-speech (TTS) quality by optimizing the agreement between frontend and speech database. We focused, in particular, on two classes of problems causing degradation in synthesis quality: 1) realization of /d/ and /t/1 sounds and 2) confusions of unstressed vowels, especially with schwas. We investigated two approaches to tackling these problems. First, we improved the phonological processing in the front end modules. Further improvement resulted from creating speaker-dependent pronunciation lexicons for automatic speech labeling of our voice databases. This change helped in alleviating many pronunciation errors that resulted from mismatches between lexical pronunciations and how the speaker (voice talent) actually pronounced a word, while keeping consistency in labeling. Each speaker has his or her own unique pronunciations (and contextdependent variations), so that no one standard lexicon is able to cover all of the speakersÂ’ variations. A subjective listening test showed that combining these two approaches resulted in perceived quality improvement for American English male and female voices.


Cite as: Kim, Y.-J., Syrdal, A., Jilka, M. (2004) Improving TTS by higher agreement between predicted versus observed pronunciations. Proc. 5th ISCA Workshop on Speech Synthesis (SSW 5), 127-132

@inproceedings{kim04_ssw,
  author={Yeon-Jun Kim and Ann Syrdal and Matthias Jilka},
  title={{Improving TTS by higher agreement between predicted versus observed pronunciations}},
  year=2004,
  booktitle={Proc. 5th ISCA Workshop on Speech Synthesis (SSW 5)},
  pages={127--132}
}