This paper introduces a novel excitation approach for speech synthesizers in which the final waveform is generated through parameters directly obtained from Hidden Markov Models (HMMs). Despite the attractiveness of the HMM-based speech synthesis technique, namely utilization of small corpora and flexibility concerning the achievement of different voice styles, synthesized speech presents a characteristic buzziness caused by the simple excitation model which is employed during the speech production. This paper presents an innovative scheme where mixed excitation is modeled through closed-loop training of a set of state-dependent filters and pulse trains, with minimization of the error between excitation and residual sequences. The proposed method shows effectiveness, yielding synthesized speech with quality far superior to the simple excitation baseline and comparable to the best excitation schemes thus far reported for HMM-based speech synthesis.
Bibliographic reference. Maia, R. / Toda, Tomoki / Zen, Heiga / Nankaku, Yoshihiko / Tokuda, Keiichi (2007): "A trainable excitation model for HMM-based speech synthesis", In INTERSPEECH-2007, 1909-1912.