Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Data-Driven Synthesis of Expressive Visual Speech Using an MPEG-4 Talking Head

Jonas Beskow, Mikael Nordenberg

KTH, Stockholm, Sweden

This paper describes initial experiments with synthesis of visual speech articulation for different emotions, using a newly developed MPEG-4 compatible talking head. The basic problem with combining speech and emotion in a talking head is to handle the interaction between emotional expression and articulation in the orofacial region. Rather than trying to model speech and emotion as two separate properties, the strategy taken here is to incorporate emotional expression in the articulation from the beginning. We use a data-driven approach, training the system to recreate the expressive articulation produced by an actor while portraying different emotions. Each emotion is modelled separately using principal component analysis and a parametric coarticulation model. The results so far are encouraging but more work is needed to improve naturalness and accuracy of the synthesized speech.

Full Paper

Bibliographic reference.  Beskow, Jonas / Nordenberg, Mikael (2005): "Data-driven synthesis of expressive visual speech using an MPEG-4 talking head", In INTERSPEECH-2005, 793-796.