Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation

Rasmus Dall, Junichi Yamagishi, Simon King


In this paper we present evidence that speech produced spontaneously in a conversation is considered more natural than read prompts. We also explore the relationship between participant’s expectations of the speech style under evaluation and their actual ratings. In successive listening tests subjects are presented with either spontaneously produced, read aloud or written sentences, and are asked to rate the naturalness of each sentence with either instructions toward conversational, reading or general natural- ness. It was found that, when presented with spontaneous or read aloud speech, participants consistently rated spontaneous speech more natural - even when asked to rate naturalness in the reading case. Presented with only text, participants generally preferred transcriptions of spontaneous utterances, except when asked to evaluate naturalness in terms of reading aloud. This has implications for the application of MOS-scale naturalness ratings in Speech Synthesis, and potentially on the type of data suitable for use both in general TTS, dialogue systems and specifically in Conversational TTS, in which the goal is to reproduce speech as it is produced in a spontaneous conversational setting.


 DOI: 10.21437/SpeechProsody.2014-191

Cite as: Dall, R., Yamagishi, J., King, S. (2014) Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation. Proc. 7th International Conference on Speech Prosody 2014, 1012-1016, DOI: 10.21437/SpeechProsody.2014-191.


@inproceedings{Dall2014,
  author={Rasmus Dall and Junichi Yamagishi and Simon King},
  title={{Rating Naturalness in Speech Synthesis: The Effect of Style and Expectation}},
  year=2014,
  booktitle={Proc. 7th International Conference on Speech Prosody 2014},
  pages={1012--1016},
  doi={10.21437/SpeechProsody.2014-191},
  url={http://dx.doi.org/10.21437/SpeechProsody.2014-191}
}