Incremental speech synthesis aims at delivering the synthetic voice while the sentence is still being typed. One of the main challenges is the online estimation of the target prosody from a partial knowledge of the sentence's syntactic structure. In the context of HMM-based speech synthesis, this typically results in missing segmental and suprasegmental features, which describe the linguistic context of each phoneme. This study describes a voice training procedure which integrates explicitly a potential uncertainty on some contextual features. The proposed technique is compared to a baseline approach (previously published), which consists in substituting a missing contextual feature by a default value calculated on the training set. Both techniques were implemented in a HMM-based Text-To-Speech system for French, and compared using objective and perceptual measurements. Experimental results show that the proposed strategy outperforms the baseline technique for this language.
Cite as: Pouget, M., Hueber, T., Bailly, G., Baumann, T. (2015) HMM training strategy for incremental speech synthesis. Proc. Interspeech 2015, 1201-1205, doi: 10.21437/Interspeech.2015-304
@inproceedings{pouget15_interspeech, author={Maël Pouget and Thomas Hueber and Gérard Bailly and Timo Baumann}, title={{HMM training strategy for incremental speech synthesis}}, year=2015, booktitle={Proc. Interspeech 2015}, pages={1201--1205}, doi={10.21437/Interspeech.2015-304} }