This paper describes a novel framework for statistical parametric speech synthesis in which statistical modeling of the speech waveform is performed through the joint estimation of acoustic and excitation model parameters. The proposed method combines extraction of spectral parameters, considered as hidden variables, and excitation signal modeling in a fashion similar to factor analyzed trajectory hidden Markov model. The resulting joint model can be interpreted as a waveform level closed-loop training, where the distance between natural and synthesized speech is minimized. An algorithm based on the maximum likelihood criterion is introduced to train the proposed joint model and some experiments are presented to show its effectiveness.
Index terms: statistical parametric speech synthesis, trajectory hidden Markov model, excitation modeling, factor analysis.
Cite as: Maia, R., Zen, H., Gales, M.J.F. (2010) Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7), 88-93
@inproceedings{maia10_ssw, author={Ranniery Maia and Heiga Zen and M. J. F. Gales}, title={{Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters}}, year=2010, booktitle={Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7)}, pages={88--93} }