ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Statistical text-to-speech synthesis with improved dynamics

Stas Tiomkin, David Malah

In statistical TTS systems (STTS), speech features dynamics is modeled by first- and second-order feature frame differences, which, typically, do not satisfactorily represent frame to frame feature dynamics present in natural speech. The reduced dynamics results in over smoothing of speech features, often sounding as muffled synthesized speech. To improve feature dynamics a Global Variance approach has been suggested. However, it is computationally complex. We propose a different approach for modeling feature dynamics based on applying the DFT to the whole set of feature frames representing a phoneme. In the transform domain the inter-frame feature dynamics is then expressed in terms of inter-harmonic content, which can be modified to statistically match the dynamics of natural speech. To synthesize a whole utterance we propose a method for smoothly combining the enhanced-dynamics phonemes, which improves synthesized speech quality of STTS with similar complexity to conventional STTS.

doi: 10.21437/Interspeech.2008-179

Cite as: Tiomkin, S., Malah, D. (2008) Statistical text-to-speech synthesis with improved dynamics. Proc. Interspeech 2008, 1841-1844, doi: 10.21437/Interspeech.2008-179

  author={Stas Tiomkin and David Malah},
  title={{Statistical text-to-speech synthesis with improved dynamics}},
  booktitle={Proc. Interspeech 2008},