Auditory-Visual Speech Processing
(AVSP 2001)

September 7-9, 2001
Aalborg, Denmark

Towards the Facecoder: Dynamic Face Synthesis Based on Image Motion Estimation in Speech

Christian Kroos, Saeko Masuda, Takaaki Kuratate, Eric Vatikiotis-Bateson

ATR International, Information Sciences Division, Seika-cho, Soraku-gun, Kyoto, Japan

The (digital) transmission of talking faces requires a high bandwidth that not every target channel is able to provide, even if powerful image compression algorithms are used. Therefore, a special face coding algorithm would be highly desirable. Unfortunately, development of such an algorithm has been hindered by the general problem of image motion estimation. In this paper we present a video-based system for face motion processing similar to the well-known voder-vocoder system for processing and coding acoustic speech signals. Like the vocoder, our 'face coder' consists of two independent parts: an analysis part for tracking non-rigid face motion, and a synthesis part for producing face animations. Results are shown for face motion tracking and the subsequent animation derived from either the raw motion data or the outcome of Principal Component Analysis. The automatic tracking results were evaluated by comparison with a set of manually tracked points.


Full Paper

Bibliographic reference.  Kroos, Christian / Masuda, Saeko / Kuratate, Takaaki / Vatikiotis-Bateson, Eric (2001): "Towards the facecoder: dynamic face synthesis based on image motion estimation in speech", In AVSP-2001, 24-29.

Multimedia Files

Link Original Filename Description Format
av01_024_1.mov (8450 KB) fm_trak.mov This movie shows the motion tracking results for the CID sentence number three 'Our janitor sweeps the floors every night'. The mesh is superimposed on the the original video sequence converted to gray-scale images, as used for the tracking. Quicktime 4 for Windows
av01_024_2.mov (3082 KB) fsyn_raw.mov The synthesis based on the raw motion tracking results can be viewed in this movie, with the original image on the right hand side and the synthesized face on the left hand side. Note that no post processing whatsoever, not even temporal low pass filtering, was applied, so that the synthesis represents strictly the raw motion tracking data. Quicktime 4 for Windows
av01_024_3.mov (3140 KB) fsyn_pca.mov This movie shows the synthesis based on a subset of retained principal components in the same way as above. Quicktime 4 for Windows