International Conference on Auditory-Visual Speech Processing 2008

Tangalooma Wild Dolphin Resort, Moreton Island, Queensland, Australia
September 26-29, 2008

German Text-to-Audiovisual-Speech by 3-D Speaker Cloning

Sascha Fagel (1), Gérard Bailly (2)

(1) Berlin Institute of Technology, Germany; (2) GIPSA-lab, Grenoble, France

Visible speech movements were optically motion captured and parameterized by means of a guided PCA. Co-articulated consonantal targets were extracted from VCVs, vocalic targets were extracted from these VCVs and from sustained vowels. Targets were selected or combined to derive target sequences for phone chains of arbitrary German utterances. Parameter trajectories for these utterances are generated by interpolating targets through linear to quadratic functions that reflect the degree of co-articulatory influence. Videos of test words embedded in a carrier sentence were rendered from parameter trajectories for an evaluation in the form of a rhyme test in noise. Results show that the synthetic videos - although intelligible only somewhat above chance level when played alone - significantly increase the recognition scores from 45.6% in audio alone presentation to 60.4% in audiovisual presentation.

