Auditory-Visual Speech Processing (AVSP) 2013

Annecy, France
August 29 - September 1, 2013

Objective and Subjective Feature Evaluation for Speaker-Adaptive Visual Speech Synthesis

Dietmar Schabus (1,2), Michael Pucher (1), Gregor Hofer (1)

(1) Telecommunications Research Center Vienna (FTW), Vienna, Austria
(2) Graz University of Technology, Graz, Austria

This paper describes an evaluation of a feature extraction method for visual speech synthesis that is suitable for speaker-adaptive training of a Hidden Semi-Markov Model (HSMM)-based visual speech synthesizer. An audio-visual corpus from three speakers was recorded. While the features used for the auditory modality are well understood, we propose to use a standard Principal Component Analysis (PCA) approach to extract suitable features for training and synthesis of the visual modality. A PCA-based approach provides dimensionality reduction and component de-correlation on the 3D facial marker data which was recorded using a facial motion capturing system. Enabling visual average “voice” training and speaker-adaptation brings a key strength of the HMM framework into both the visual and the audio-visual domain. An objective evaluation based on reconstruction error calculations, as well as a perceptual evaluation with 40 test subjects, show that PCA is well suited for feature extraction from multiple speakers, even in a challenging adaptation scenario where no data from the target speaker is available during PCA.

Full Paper

Bibliographic reference.  Schabus, Dietmar / Pucher, Michael / Hofer, Gregor (2013): "Objective and subjective feature evaluation for speaker-adaptive visual speech synthesis", In AVSP-2013, 37-42.