Auditory-Visual Speech Processing (AVSP) 2013
This paper describes an evaluation of a feature extraction method for visual speech synthesis that is suitable for speaker-adaptive training of a Hidden Semi-Markov Model (HSMM)-based visual speech synthesizer. An audio-visual corpus from three speakers was recorded. While the features used for the auditory modality are well understood, we propose to use a standard Principal Component Analysis (PCA) approach to extract suitable features for training and synthesis of the visual modality. A PCA-based approach provides dimensionality reduction and component de-correlation on the 3D facial marker data which was recorded using a facial motion capturing system. Enabling visual average voice training and speaker-adaptation brings a key strength of the HMM framework into both the visual and the audio-visual domain. An objective evaluation based on reconstruction error calculations, as well as a perceptual evaluation with 40 test subjects, show that PCA is well suited for feature extraction from multiple speakers, even in a challenging adaptation scenario where no data from the target speaker is available during PCA.
Bibliographic reference. Schabus, Dietmar / Pucher, Michael / Hofer, Gregor (2013): "Objective and subjective feature evaluation for speaker-adaptive visual speech synthesis", In AVSP-2013, 37-42.