We present a linear three-dimensional modeling paradigm for lips and face, that captures the audiovisual speech activity of a given speaker by only six parameters. Our articulatory models are constructed from real data (front and profile images), using a linear component analysis of about 200 3D coordinates of fleshpoints on the subject's face and lips. Compared to a raw component analysis, our construction approach leads to somewhat more comparable relations across subjects: by construction, the six parameters have a clear phonetic/articulatory interpretation. We use such a speaker's specific articulatory model to regularize MPEG-4 facial articulation parameters (FAP) and show that this regularization process can drastically reduce bandwidth, noise and quantization artifacts. We then present how analysis-by-synthesis techniques using the speaker-specific model allows the tracking of facial movements. Finally, the results of this tracking scheme have been used to develop a text-to-audiovisual speech system.
Cite as: Elisei, F., Odisio, M., Bailly, G., Badin, P. (2001) Creating and controlling video-realistic talking heads. Proc. Auditory-Visual Speech Processing, 90-97
@inproceedings{elisei01_avsp, author={F. Elisei and M. Odisio and Gérard Bailly and Pierre Badin}, title={{Creating and controlling video-realistic talking heads}}, year=2001, booktitle={Proc. Auditory-Visual Speech Processing}, pages={90--97} }