Predicting Head Pose from Speech with a Conditional Variational Autoencoder

David Greenwood, Stephen Laycock, Iain Matthews


Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable.

Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek a transformation from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue. Natural, expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose.

Recently, Long Short Term Memory (LSTM) networks have become an important tool for modelling speech and natural language tasks. We employ Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language, to model the relationship that speech has with rigid head motion. We then extend our model by conditioning with prior motion. Finally, we introduce a generative head motion model, conditioned on audio features using a Conditional Variational Autoencoder (CVAE). Each approach mitigates the problems of the one to many mapping that a speech to head pose model must accommodate.


 DOI: 10.21437/Interspeech.2017-894

Cite as: Greenwood, D., Laycock, S., Matthews, I. (2017) Predicting Head Pose from Speech with a Conditional Variational Autoencoder. Proc. Interspeech 2017, 3991-3995, DOI: 10.21437/Interspeech.2017-894.


@inproceedings{Greenwood2017,
  author={David Greenwood and Stephen Laycock and Iain Matthews},
  title={Predicting Head Pose from Speech with a Conditional Variational Autoencoder},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3991--3995},
  doi={10.21437/Interspeech.2017-894},
  url={http://dx.doi.org/10.21437/Interspeech.2017-894}
}