ISCA Archive SSW 2023
ISCA Archive SSW 2023

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Eva Szekely, Gustav Eje Henter

With read-aloud speech synthesis achieving high naturalnessscores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalitiesin a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability ofhuman speech and motion, and risk producing oversmoothingartefacts and sub-optimal synthesis quality. We present thefirst diffusion-based probabilistic model, called Diff-TTSG, thatjointly learns to synthesise speech and gestures together. Ourmethod can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesissystems, and use them to validate our proposed approach.


doi: 10.21437/SSW.2023-24

Cite as: Mehta, S., Wang, S., Alexanderson, S., Beskow, J., Szekely, E., Henter, G.E. (2023) Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 150-156, doi: 10.21437/SSW.2023-24

@inproceedings{mehta23_ssw,
  author={Shivam Mehta and Siyang Wang and Simon Alexanderson and Jonas Beskow and Eva Szekely and Gustav Eje Henter},
  title={{Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={150--156},
  doi={10.21437/SSW.2023-24}
}