Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Synthesising Hyperarticulation in Unit Selection TTS

Matthew P. Aylett

University of Edinburgh, UK

Within speech synthesis we often wish to give extra focus to words which carry important information, such as names, dates and amounts. In this paper we look carefully at cost functions that can be used to bias unit selection in favour of hyper-articulated speech in order to give this impression of focus. Hyper-articulated speech tends to be accented, emphatic and requires more articulatory effort. We apply two cost functions to try to force the selection of hyper-articulated speech. The first operates on the duration of units in the unit selection database, the second on the language redundancy (word trigram predictability) of the word containing the unit. We estimate their relative importance in selecting hyperarticulated speech in unit selection speech synthesis. A listening test was carried out where these cost functions were applied to one random content word in a haskins anomalous sentence. Listeners were asked to select the two clearest and most focused words from the sentence. The duration increasing cost function was significantly related to an increase in perceived prominence whereas low redundancy, and a combination of both approaches did not produce significant results. Thus, although a significant correlation exists between the average duration and redundancy of diphones and perceived prominence, such a correlation was not smoothly translated into error free method for altering such perceived prominence.

Full Paper

Bibliographic reference.  Aylett, Matthew P. (2005): "Synthesising hyperarticulation in unit selection TTS", In INTERSPEECH-2005, 2521-2524.