September 22-25, 1997
Concatenative text-to-speech (TTS) systems are now quite widespread through the availability of simple time- domain speech modification algorithms. Many of these systems produce intelligible speech with a higher degree of naturalness than that achieved by the previous generation of formant synthesis systems. This perceived improvement in quality has lead to the view in some circles that TTS is a solved problem, at least for many practical applications. Three experiments are reported in this paper, all performed with a concatenative TTS system. These experiments investigated aspects of the concatenative model by respectively addressing copy synthesis of emotional speech, modelling glottalisation, and the effect of speech database design on the quality of synthesised speech. This paper suggests that the lack of an explicit speech model in most concatenative synthesis strategies fundamentally limits the usefulness of many current systems to the relatively restricted task of 'neutral' spoken renderings of text, where deficiencies in other system components usually mask the limitations of the synthesis strategy itself.
Acoustic Examples: #1 #2 #3 #4 #5
Bibliographic reference. Edgington, Mike (1997): "Investigating the limitations of concatenative synthesis", In EUROSPEECH-1997, 593-596.