Word Emphasis Prediction for Expressive Text to Speech

Yosi Mass, Slava Shechtman, Moran Mordechay, Ron Hoory, Oren Sar Shalom, Guy Lev, David Konopnicki


Word emphasis prediction is an important part of expressive prosody generation in modern Text-To-Speech (TTS) systems. We present a method for predicting emphasized words for expressive TTS, based on a Deep Neural Network (DNN). We show that the presented method outperforms machine learning methods based on hand-crafted features in terms of objective metrics such as precision and recall. Using a listening test, we further demonstrate that the contribution of the predicted emphasized words to the expressiveness of the synthesized speech is subjectively perceivable.


 DOI: 10.21437/Interspeech.2018-1159

Cite as: Mass, Y., Shechtman, S., Mordechay, M., Hoory, R., Sar Shalom, O., Lev, G., Konopnicki, D. (2018) Word Emphasis Prediction for Expressive Text to Speech. Proc. Interspeech 2018, 2868-2872, DOI: 10.21437/Interspeech.2018-1159.


@inproceedings{Mass2018,
  author={Yosi Mass and Slava Shechtman and Moran Mordechay and Ron Hoory and Oren {Sar Shalom} and Guy Lev and David Konopnicki},
  title={Word Emphasis Prediction for Expressive Text to Speech},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2868--2872},
  doi={10.21437/Interspeech.2018-1159},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1159}
}