With read-aloud speech synthesis achieving high naturalnessscores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalitiesin a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability ofhuman speech and motion, and risk producing oversmoothingartefacts and sub-optimal synthesis quality. We present thefirst diffusion-based probabilistic model, called Diff-TTSG, thatjointly learns to synthesise speech and gestures together. Ourmethod can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesissystems, and use them to validate our proposed approach.
Cite as: Mehta, S., Wang, S., Alexanderson, S., Beskow, J., Szekely, E., Henter, G.E. (2023) Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 150-156, doi: 10.21437/SSW.2023-24
@inproceedings{mehta23_ssw, author={Shivam Mehta and Siyang Wang and Simon Alexanderson and Jonas Beskow and Eva Szekely and Gustav Eje Henter}, title={{Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis}}, year=2023, booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)}, pages={150--156}, doi={10.21437/SSW.2023-24} }