Natural-sounding speech synthesis using variable-length units

Jon R. W. Yi, James R. Glass

The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create natural-sounding speech. Our initial work showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was achievable with word- and phrase-level concatenation. In order to extend the flexibility of this framework, we focused on generating novel words from a corpus of sub-word units. The design of the corpus was motivated by perceptual experiments that investigated where speech could be spliced with minimal audible distortion and what contextual constraints were necessary to maintain in order to produce natural-sounding speech. From this sub-word corpus, a Viterbi search selects a sequence of units based on how well they match the input specification and concatenation constraints. This concatenative speech synthesis system, ENVOICE, has been used in a conversational system in two application domains to convert meaning representations into speech waveforms.

doi: 10.21437/ICSLP.1998-575

