Auditory-Visual Speech Processing
(AVSP 2001)

September 7-9, 2001
Aalborg, Denmark

Visual Speech Synthesis Using Statistical Models of Shape and Appearance

Barry J. Theobald (1), J. Andrew Bangham (1), Iain Matthews (2), Gavin C. Cawley (1)

(1) School of Information Systems, University of East Anglia, Norwich, NR4 7TJ, UK.
(2) Robotics Institute, Carnegie Mellon, Pittsburgh, PA, USA

In this paper we present preliminary results of work towards a video-realistic visual speech synthesizer based on statistical models of shape and appearance. A sequence of images corresponding to an utterance is formed by concatenation of synthesis units (in this case triphones) from a pre-recorded inventory. Initial work has concentrated on a compact representation of human faces, accommodating an extensive visual speech corpus without incurring excessive storage costs. The minimal set of control parameters of a combined appearance model is selected according to formal subjective testing. We also present two methods used to build statistical models that account for the perceptually important regions of the face.

Full Paper

Bibliographic reference.  Theobald, Barry J. / Bangham, J. Andrew / Matthews, Iain / Cawley, Gavin C. (2001): "Visual speech synthesis using statistical models of shape and appearance", In AVSP-2001, 78-83.