Incremental dialogue systems can produce fast responses and can interact in a human-like fashion. However, these systems occasionally produce erroneous material or run out of things to say. Humans in such situations use disfluencies to remedy their ongoing production and signal this to the listener. We devised a new model for inserting disfluencies into synthesis and evaluated this approach in a perception test. It showed that lengthenings and silent pauses can be built for speech synthesis with low effort and high output quality. Synthesized word fragments and filled pauses, while potentially useful in incremental dialogue systems, appear more difficult to handle for listeners. While we were able to get consistently high ratings for certain types of disfluencies, the need for more basic research on their micro structure became apparent in order to be able to synthesize the fine phonetic detail of disfluencies. For this, we analysed corpus data with regard to distributional and durational aspects of lengthenings, word fragments and pauses. Based on these natural speaking strategies, we explored further to what extent speech can be delayed using disfluency strategies, and how to handle difficult disfluency elements by determining the appropriate amount of durational variation applicable.
Bibliographic reference. Betz, Simon / Wagner, Petra / Schlangen, David (2015): "Micro-structure of disfluencies: basics for conversational speech synthesis", In INTERSPEECH-2015, 2222-2226.