Auditory-Visual Speech Processing
(AVSP 2001)

September 7-9, 2001
Aalborg, Denmark

Creating and Controlling Video-Realistic Talking Heads

F. Elisei, M. Odisio, Gérard Bailly, Pierre Badin

Institut de la Communication Parlée, Grenoble, France

We present a linear three-dimensional modeling paradigm for lips and face, that captures the audiovisual speech activity of a given speaker by only six parameters. Our articulatory models are constructed from real data (front and profile images), using a linear component analysis of about 200 3D coordinates of fleshpoints on the subject's face and lips. Compared to a raw component analysis, our construction approach leads to somewhat more comparable relations across subjects: by construction, the six parameters have a clear phonetic/articulatory interpretation. We use such a speaker's specific articulatory model to regularize MPEG-4 facial articulation parameters (FAP) and show that this regularization process can drastically reduce bandwidth, noise and quantization artifacts. We then present how analysis-by-synthesis techniques using the speaker-specific model allows the tracking of facial movements. Finally, the results of this tracking scheme have been used to develop a text-to-audiovisual speech system.

Bibliographic reference.  Elisei, F. / Odisio, M. / Bailly, Gérard / Badin, Pierre (2001): "Creating and controlling video-realistic talking heads", In AVSP-2001, 90-97.

Multimedia Files

Link Original Filename Description Format
av01_090_01.avi (1183 KB) NomoJ1.avi First jaw-driven articulator (height) Video File - AVI
av01_090_02.avi (1182 KB) NomoL1.avi First lips-driven articulator (width/protrusion) Video File - AVI
av01_090_03.avi (1185 KB) NomoL2.avi Second lips-driven articulator (lower lip) Video File - AVI
av01_090_04.avi (1185 KB) NomoL3.avi Second lips-driven articulator (lower lip) Video File - AVI
av01_090_05.avi (1184 KB) NomoJ2.avi Second jaw articulator (advance) Video File - AVI
av01_090_06.avi (1182 KB) NomoL1.avi Residual articulator (larynx skin) Video File - AVI
av01_090_07.avi (1775 KB) fap_capuchon.avi Playing the same FAP stream on 2 different clones Video File - AVI
av01_090_08.avi (1052 KB) aga_half.avi Side by side : analysis-by-synthesis inversion, half superimposed on the tracked video (learning conditions) Video File - AVI
av01_090_09.avi (1984 KB) salam_vid.avi Example of reconstruction/tracking in learning conditions Video File - AVI
av01_090_10.avi (21566 KB) bise.avi Reconstruction of a long sequence (tracked in learning conditions), with recovered jaw movements. Video File - AVI
av01_090_11.avi (1925 KB) capuchon.avi Tracking in natural conditions: superimposing the resulting articulations through the wire-frame model Video File - AVI
av01_090_12.avi (3493 KB) jaw_recover.avi Showing the recovered jaw movements (learning conditions) Video File - AVI
av01_090_13.avi (1456 KB) jaw_recover_02.avi Showing the recovered jaw and 3D movements from a single front-only view (natural conditions) Video File - AVI
av01_090_14.avi (20502 KB) tts_icp_fr.avi Output of our text to audio-visual speech system Video File - AVI
av01_090_15.avi (1703 KB) 3D_tongue.avi The ICP 3D tongue (linked with the jaw parameters) Video File - AVI