Fourth ISCA ITRW on Speech Synthesis
August 29 - September 1, 2001
This paper describes the current status of the IBM Trainable Speech Synthesis System. The system is a state-of-the-art, trainable, unit-selection based concatenative speech synthesiser. The system uses hidden Markov models (HMMs) to provide a phonetic transcription and HMM state alignment of a database of single-speaker continuous-speech training data. The runtime synthesiser uses the HMM state sized segments that result as its basic synthesis units. It determines which segments to concatenate to produce a target sentence using decision trees built from the training data and a dynamic programming search to optimise a perceptually motivated cost function. The synthesiser can operate both in general domain Text-to-Speech mode, and in Phrase Splicing mode to provide higher quality synthesis in limited domains. Systems have been built in at least 10 different languages and over 70 voices.
Donovan, R. / Ittycheriah, A. / Franz, M. / Ramabhadran, B. / Eide, E. /
Viswanathan, M. / Bakis, R. / Hamza, W. / Picheny, M. / Gleason, P. / Rutherfoord, T. /
Cox, P. / Green, D. / Janke, E. / Revelin, S. / Waast, C. / Zeller, B. /
Guenther, C. / Kunzmann, J. (2001):
"Current status of the IBM Trainable Speech Synthesis System",
In SSW4-2001, paper 207.