Altering speech synthesis prosody through real time natural gestural control

David Abelman, Robert Clark


This paper investigates the usage of natural gestural controls to alter synthesised speech prosody in real time (for example, recognising a one-handed beat as a cue to emphasise a certain word in a synthesised sentence). A user’s gestures are recognised using a Microsoft Kinect sensor, and synthesised speech prosody is altered through a series of hand-crafted rules running through a modified HTS engine (pHTS, developed at Université de Mons). Two sets of preliminary experiments are carried out. Firstly, it is shown that users can control the device to a moderate level of accuracy, though this is projected to improve further as the system is refined. Secondly, it is shown that the prosody of the altered output is significantly preferred to that of the baseline pHTS synthesis. Future work is recommended to focus on learning gestural and prosodic rules from data, and in using an updated version of the underlying pHTS engine. The reader is encouraged to watch a short video demonstration of the work at http://tinyurl.com/gesture-prosody.


 DOI: 10.21437/SpeechProsody.2014-182

Cite as: Abelman, D., Clark, R. (2014) Altering speech synthesis prosody through real time natural gestural control. Proc. 7th International Conference on Speech Prosody 2014, 969-973, DOI: 10.21437/SpeechProsody.2014-182.


@inproceedings{Abelman2014,
  author={David Abelman and Robert Clark},
  title={{Altering speech synthesis prosody through real time natural gestural control}},
  year=2014,
  booktitle={Proc. 7th International Conference on Speech Prosody 2014},
  pages={969--973},
  doi={10.21437/SpeechProsody.2014-182},
  url={http://dx.doi.org/10.21437/SpeechProsody.2014-182}
}