Auditory-Visual Speech Processing 2005
British Columbia, Canada
We have implemented a complete text-to-speech synthesis system by concatenation that addresses French Manual Cued Speech (FMCS). It uses two separate dictionaries, one for multimodal diphones with audio and facial articulation, and the other with the gestures between two consecutive FMCS keys (``dikeys''). Dictionaries were built from real data.
This paper presents our methodology and the final results, illustrated by accompanying videos. We recorded and analyzed the 3D trajectories of 50 hand and 63 facial fleshpoints during the production of 238 utterances carefully designed to cover all possible diphones of French. Linear and non-linear statistical models of hand and face deformations and postures were developed using both separate and joint corpora. Additional data allowed us to capture the shape of the hand and face with a higher spatial density (2,600 points for the hand and forearm and 2,000 for the face), as well as their appearance. We succeeded in building new high-density articulated models that were compatible with the previous emerging set of control parameters. This allows the outputted synthesis parameters to drive the more realistic 3D models instead of the low-density ones.
Bibliographic reference. Elisei, Frédéric / Bailly, Gérard / Gibert, Guillaume / Brun, Remi (2005): "Capturing data and realistic 3d models for cued speech analysis and audiovisual synthesis", In AVSP-2005, 125-130.