We propose a method for concatenative speech synthesis that permits to obtain a better matching between the logF0 and duration predicted by the prosody module and the waveform generation back-end. The proposed method is based upon our previous multilevel parametric F0 model and Toshibas plural unit selection and fusion synthesizer. The method adds a feedback loop from the back-end into the prosody module so that the prosodical information of the selected units is used to re-estimate new prosody values. The feedback loop defines a frame-level prosody model which consists of the average value and variance of the duration and logF0 of the selected units. The log-likelihood defined by this model is added to the log-likelihood of the prosody model. From the maximization of this total log-likelihood, we obtain the prosody values that produce the optimum compromise between the distortion introduced by F0 discontinuities and the one created by the prosody adjusting signal processing.
Cite as: Latorre, J., Gracia, S., Akamine, M. (2009) Feedback loop for prosody prediction in concatenative speech synthesis. Proc. Interspeech 2009, 2067-2070, doi: 10.21437/Interspeech.2009-593
@inproceedings{latorre09_interspeech, author={Javier Latorre and Sergio Gracia and Masami Akamine}, title={{Feedback loop for prosody prediction in concatenative speech synthesis}}, year=2009, booktitle={Proc. Interspeech 2009}, pages={2067--2070}, doi={10.21437/Interspeech.2009-593} }