Controlling Prominence Realisation in Parametric DNN-Based Speech Synthesis

Zofia Malisz, Harald Berthelsen, Jonas Beskow, Joakim Gustafson


This work aims to improve text-to-speech synthesis for Wikipedia by advancing and implementing models of prosodic prominence. We propose a new system architecture with explicit prominence modeling and test the first component of the architecture. We automatically extract a phonetic feature related to prominence from the speech signal in the ARCTIC corpus. We then modify the label files and train an experimental TTS system based on the feature using Merlin, a statistical-parametric DNN-based engine. Test sentences with contrastive prominence on the word-level are synthesised and separate listening tests a) evaluating the level of prominence control in generated speech, and b) naturalness, are conducted. Our results show that the prominence feature-enhanced system successfully places prominence on the appropriate words and increases perceived naturalness relative to the baseline.


 DOI: 10.21437/Interspeech.2017-1355

Cite as: Malisz, Z., Berthelsen, H., Beskow, J., Gustafson, J. (2017) Controlling Prominence Realisation in Parametric DNN-Based Speech Synthesis. Proc. Interspeech 2017, 1079-1083, DOI: 10.21437/Interspeech.2017-1355.


@inproceedings{Malisz2017,
  author={Zofia Malisz and Harald Berthelsen and Jonas Beskow and Joakim Gustafson},
  title={Controlling Prominence Realisation in Parametric DNN-Based Speech Synthesis},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1079--1083},
  doi={10.21437/Interspeech.2017-1355},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1355}
}