Evaluating comprehension of natural and synthetic conversational speech

Mirjam Wester, Oliver Watts, Gustav Eje Henter


Current speech synthesis methods typically operate on isolated sentences and lack convincing prosody when generating longer segments of speech. Similarly, prevailing TTS evaluation paradigms, such as intelligibility (transcription word error rate) or MOS, only score sentences in isolation, even though overall comprehension is arguably more important for speech-based communication. In an effort to develop more ecologically-relevant evaluation techniques that go beyond isolated sentences, we investigated comprehension of natural and synthetic speech dialogues. Specifically, we tested listener comprehension on long segments of spontaneous and engaging conversational speech (three 10-minute radio interviews of comedians). Interviews were reproduced either as natural speech, synthesised from carefully prepared transcripts, or synthesised using durations from forced-alignment against the natural speech, all in a balanced design. Comprehension was measured using multiple choice questions. A significant difference was measured between the comprehension/retention of natural speech (74\% correct responses) and synthetic speech with forced-aligned durations (61\% correct responses). However, no significant difference was observed between natural and regular synthetic speech (70\% correct responses). Effective evaluation of comprehension remains elusive.


DOI: 10.21437/SpeechProsody.2016-157

Cite as

Wester, M., Watts, O., Henter, G.E. (2016) Evaluating comprehension of natural and synthetic conversational speech. Proc. Speech Prosody 2016, 766-770.

Bibtex
@inproceedings{Wester+2016,
author={Mirjam Wester and Oliver Watts and Gustav Eje Henter},
title={Evaluating comprehension of natural and synthetic conversational speech},
year=2016,
booktitle={Speech Prosody 2016},
doi={10.21437/SpeechProsody.2016-157},
url={http://dx.doi.org/10.21437/SpeechProsody.2016-157},
pages={766--770}
}