INTERSPEECH 2004 - ICSLP
We have already presented a system that can track the 3D speech movements of a speaker's face in a monocular video sequence. For that purpose, speaker-specific models of the face have been built, including a 3D shape model and several appearance models. In this paper, speech movements estimated using this system are perceptually evaluated. These movements are re-synthesised using a Point-Light (PL) rendering. They are paired with original audio signals degraded with white noise at several SNR. We study how much such PL movements enhance the identification of logatoms, and also to what extent they influence the perception of incongruent audio-visual logatoms. In a first experiment, the PL rendering is evaluated per se. Results seem to confirm other previous studies: though less efficient than actual video, PL speech enhances intelligibility and can reproduce the McGurk effect. In the second experiment, the movements have been estimated with our tracking framework with various appearance models. No salient differences are revealed between the performances of the appearance models.
Bibliographic reference. Odisio, Matthias / Bailly, Gérard (2004): "Audiovisual perceptual evaluation of resynthesised speech movements", In INTERSPEECH-2004, 2029-2032.