5th International Conference on Spoken Language Processing
The speech synthesis system described in this paper uses a set of speaker-dependent decision-tree state-clustered hidden Markov models to automatically generate a leaf level segmentation of a large single-speaker continuous-read-speech database. During synthesis, the phone sequence to be synthesised is converted to an acoustic leaf sequence by descending the HMM decision trees. Duration, energy and pitch values are predicted using separate trainable models. To determine the segment sequence to concatenate, a dynamic programming (d.p.) search is performed over all the waveform segments aligned to each leaf in training. The d.p. attempts to ensure that the selected segments join each other spectrally, and have durations, energies and pitches such that the amount of degradation introduced by the subsequent use of TD-PSOLA is minimised; the selected segments are concatenated and modified to have the required prosodic values using TD-PSOLA. The d.p. results in the system effectively selecting variable length units.
Sound Example. This example is the synthetic sentence "When a sailor in a small craft faces the might of the vast Atlantic Ocean today, he takes the same risks as generations took before him.".
Bibliographic reference. Donovan, Robert E. / Eide, Ellen M. (1998): "The IBM trainable speech synthesis system", In ICSLP-1998, paper 0166.