As synthetic voices become more flexible, and conversational systems gain more potential to adapt to the environmental and social situation, the question needs to be examined, how different modifications to the synthetic speech interact with each other and how their specific combinations influence perception. This work investigates how the vocal effort of the synthetic speech together with added disfluencies affect listeners’ perception of the degree of uncertainty in an utterance. We introduce a DNN voice built entirely from spontaneous conversational speech data and capable of producing a continuum of vocal efforts, prolongations and filled pauses with a corpus-based method. Results of a listener evaluation indicate that decreased vocal effort, filled pauses and prolongation of function words increase the degree of perceived uncertainty of conversational utterances expressing the speaker’s beliefs. We demonstrate that the effect of these three cues are not merely additive, but that interaction effects, in particular between the two types of disfluencies and between vocal effort and prolongations need to be considered when aiming to communicate a specific level of uncertainty. The implications of these findings are relevant for adaptive and incremental conversational systems using expressive speech synthesis and aspiring to communicate the attitude of uncertainty.
Cite as: Székely, É., Mendelson, J., Gustafson, J. (2017) Synthesising Uncertainty: The Interplay of Vocal Effort and Hesitation Disfluencies. Proc. Interspeech 2017, 804-808, doi: 10.21437/Interspeech.2017-1507
@inproceedings{szekely17_interspeech, author={Éva Székely and Joseph Mendelson and Joakim Gustafson}, title={{Synthesising Uncertainty: The Interplay of Vocal Effort and Hesitation Disfluencies}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={804--808}, doi={10.21437/Interspeech.2017-1507} }