Sixth European Conference on Speech Communication and Technology
Today’s automated telephone services generally use recorded speech from one speaker for all output. In applications with large and varying output vocabularies, such as place names, it may be necessary to employ a second speaker to provide new vocabulary items if the original speaker is not available, or to use text-tospeech (TTS) synthesis for the whole or parts of the output. This paper reports a comparison of 10 schemes for the generation of spoken output in a travel information service, ranging from natural speech from a single speaker, through combinations of different voices and of natural and synthetic speech, to TTS synthesis throughout. The results show strong preferences for concatenated speech over TTS and for quality recordings over amateur ones, and a weaker preference for a single speaker over two speakers.
Full Paper (PDF)
Acoustic Example #1 (V0)
Acoustic Example #2 (V1)
Acoustic Example #3 (V2)
Acoustic Example #4 (V3)
Acoustic Example #5 (V5)
Bibliographic reference. McInnes, F. R. / Attwater, D. J. / Edgington, Michael D. / Schmidt, Mark S. / Jack, Mervyn A. (1999): "User attitudes to concatenated natural speech and text-to-speech synthesis in an automated information service", In EUROSPEECH'99, 831-834.