How to train your fillers: uh and um in spontaneous speech synthesis

Éva Székely, Gustav Eje Henter, Jonas Beskow, Joakim Gustafson


Using spontaneous conversational speech for TTS raises questions on how disfluencies such as filled pauses (FPs) should be approached. Detailed annotation of FPs in training data enables precise control at synthesis time; coarse or nonexistent FP annotation, when combined with stochastic attention-based neural TTS, leads to synthesisers that insert these phenomena into fluent prompts on their own accord. In this study we investigate, objectively and subjectively, the effects of FP annotation and the impact of relinquishing control over FPs in a Tacotron TTS system. The training corpus comprised 9 hours of singlespeaker breath groups extracted from a conversational podcast. Systems trained with no or location-only FP annotation were found to reproduce FP locations and types (uh/um) in a pattern broadly similar to that of the corpus. We also studied the effect of FPs on natural and synthetic speech rate and the interchangeability of FP types. Interestingly, subjective tests indicate that synthesiser-predicted FP types from location-only annotation often were preferred over specifying the ground-truth type. In contrast, a more precise annotation, allowing us to focus training on the most fluent parts of the corpus, improved rated naturalness when synthesising fluent speech.


 DOI: 10.21437/SSW.2019-44

Cite as: Székely, É., Eje Henter, G., Beskow, J., Gustafson, J. (2019) How to train your fillers: uh and um in spontaneous speech synthesis. Proc. 10th ISCA Speech Synthesis Workshop, 245-250, DOI: 10.21437/SSW.2019-44.


@inproceedings{Székely2019,
  author={Éva Székely and Gustav  {Eje Henter} and Jonas Beskow and Joakim Gustafson},
  title={{How to train your fillers: uh and um in spontaneous speech synthesis}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={245--250},
  doi={10.21437/SSW.2019-44},
  url={http://dx.doi.org/10.21437/SSW.2019-44}
}