From 3-d speaker cloning to text-to-audiovisual-speech

Sascha Fagel, Frédéric Elisei, Gérard Bailly

Visible speech movements were motion captured and parameterized. Coarticulated targets were extracted from VCVs and modeled to generate arbitrary German utterances by target interpolation. The system was extended to synthesize English utterances by a mapping to German phonemes. An evaluation by means of a modified rhyme test reveals that the synthetic videos of isolated words increase the recognition scores from 27% to 47.5% when added to audio only presentation.

Cite as: Fagel, S., Elisei, F., Bailly, G. (2008) From 3-d speaker cloning to text-to-audiovisual-speech. Proc. Interspeech 2008, 2325

