We implement an architecture with explicit prominence learning via a prominence network in Merlin, a statistical-parametric DNN-based text-to-speech system. We build on our previous results that successfully evaluated the inclusion of an automatically extracted, speech-based prominence feature into the training and its control at synthesis time. In this work, we expand the PROMIS system by implementing the prominence network that predicts prominence values from text. We test the network predictions as well as the effects of a prominence control module based on SSML-like tags. Listening tests for the complete PROMIS system, combining a prominence feature, a prominence network and prominence control, show that it effectively controls prominence in a diagnostic set of target words. It also does not negatively impact the perceived naturalness relative to the baseline when one of the tested tagging methods is used.
Cite as: Malisz, Z., Berthelsen, H., Beskow, J., Gustafson, J. (2019) PROMIS: a statistical-parametric speech synthesis system with prominence control via a prominence network. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), 257-262, doi: 10.21437/SSW.2019-46
@inproceedings{malisz19_ssw, author={Zofia Malisz and Harald Berthelsen and Jonas Beskow and Joakim Gustafson}, title={{PROMIS: a statistical-parametric speech synthesis system with prominence control via a prominence network}}, year=2019, booktitle={Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10)}, pages={257--262}, doi={10.21437/SSW.2019-46} }