Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

User Attitudes to Concatenated Natural Speech and Text-to-Speech Synthesis in an Automated Information Service

F. R. McInnes (1), D. J. Attwater (2), Michael D. Edgington (3), Mark S. Schmidt (4), Mervyn A. Jack (1)

(1) CCIR, The University of Edinburgh, Edinburgh, UK
(2) BT Laboratories, Martlesham, Heath, Ipswich, UK
(3) SRI International, Cambridge, UK
(4) Andersen Consulting, UK
Mike Edgington was at BT Laboratories and Mark Schmidt was at CCIR at the time of the experiment reported here.

Today’s automated telephone services generally use recorded speech from one speaker for all output. In applications with large and varying output vocabularies, such as place names, it may be necessary to employ a second speaker to provide new vocabulary items if the original speaker is not available, or to use text-tospeech (TTS) synthesis for the whole or parts of the output. This paper reports a comparison of 10 schemes for the generation of spoken output in a travel information service, ranging from natural speech from a single speaker, through combinations of different voices and of natural and synthetic speech, to TTS synthesis throughout. The results show strong preferences for concatenated speech over TTS and for quality recordings over amateur ones, and a weaker preference for a single speaker over two speakers.

Full Paper (PDF)   Gnu-Zipped Postscript

Acoustic Example #1 (V0)
Acoustic Example #2 (V1)
Acoustic Example #3 (V2)
Acoustic Example #4 (V3)
Acoustic Example #5 (V5)

Bibliographic reference.  McInnes, F. R. / Attwater, D. J. / Edgington, Michael D. / Schmidt, Mark S. / Jack, Mervyn A. (1999): "User attitudes to concatenated natural speech and text-to-speech synthesis in an automated information service", In EUROSPEECH'99, 831-834.