In statistical TTS systems (STTS), speech features dynamics is modeled by first- and second-order feature frame differences, which, typically, do not satisfactorily represent frame to frame feature dynamics present in natural speech. The reduced dynamics results in over smoothing of speech features, often sounding as muffled synthesized speech. To improve feature dynamics a Global Variance approach has been suggested. However, it is computationally complex. We propose a different approach for modeling feature dynamics based on applying the DFT to the whole set of feature frames representing a phoneme. In the transform domain the inter-frame feature dynamics is then expressed in terms of inter-harmonic content, which can be modified to statistically match the dynamics of natural speech. To synthesize a whole utterance we propose a method for smoothly combining the enhanced-dynamics phonemes, which improves synthesized speech quality of STTS with similar complexity to conventional STTS.
Bibliographic reference. Tiomkin, Stas / Malah, David (2008): "Statistical text-to-speech synthesis with improved dynamics", In INTERSPEECH-2008, 1841-1844.