9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

From 3-D Speaker Cloning to Text-to-Audiovisual-Speech

Sascha Fagel (1), Frédéric Elisei (2), Gérard Bailly (2)

(1) Technische Universität Berlin, Germany; (2) GIPSA, France

Visible speech movements were motion captured and parameterized. Coarticulated targets were extracted from VCVs and modeled to generate arbitrary German utterances by target interpolation. The system was extended to synthesize English utterances by a mapping to German phonemes. An evaluation by means of a modified rhyme test reveals that the synthetic videos of isolated words increase the recognition scores from 27% to 47.5% when added to audio only presentation.

Full Paper

Bibliographic reference.  Fagel, Sascha / Elisei, Frédéric / Bailly, Gérard (2008): "From 3-d speaker cloning to text-to-audiovisual-speech", In INTERSPEECH-2008, 2325.