Sixth European Conference on Speech Communication and Technology
Toshiba English Text-to-Speech Synthesizer utilizes several new techniques to produce synthesized speech that is more natural-sounding and intelligible than that created by conventional synthesizers. The closed-loop training method creates synthesis units that most closely resemble the training data and are the least susceptible to prosodic distortion noise by analytically solving an equation that minimizes distortion between target units and training data. The pitch contour model creates a codebook of representative word-based F0 contours by first clustering the training data using word stress and syllable numbers. Within each cluster, the training data is divided into different groups using lexical and phonological attributes of each word. In each group, a representative contour is created using approximate error estimation. The resulting approximate errors are used in offset level prediction for each contour. These techniques have significantly improved the prosodic quality, naturalness and intelligibility of the resulting synthesized speech.
Full Paper (PDF)
Acoustic Example #1
Acoustic Example #2
Acoustic Example #3
Bibliographic reference. Suh, Chang K. / Kagoshima, Takehiko / Morita, Masahiro / Seto, Shigenobu / Akamine, Masami (1999): "Toshiba English text-to-speech synthesizer (TESS)", In EUROSPEECH'99, 2111-2114.