September 22-25, 1997
We have developed a visual speech synthesizer from unlimited French text, and synchronized it to an audio text-to-speech synthesizer also developed at the ICP (Le Goff & Benoit, 1996). The front-end of our synthesizer is a 3-D model of the face whose speech gestures are controlled by eight parameters: Five for the lips, one for the chin, two for the tongue. In contrast to most of the existing systems which are based on a limited set of prestored facial images, we have adopted the parametric approach to coarticulation first proposed by Cohen and Massaro (1993). We have thus implemented a coarticulation model based on spline-like functions, defined by three coefficients, applied to each target in a library of 16 French visemes. However, unlike Cohen & Massaro (1993), we have adopted a data-driven approach to identify the many coefficients necessary to model coarticulation. To do so, we systematically analyzed an ad-hoc corpus uttered by a French male speaker. We have then run an intelligibility test to quantify the benefit of seeing the synthetic face (in addition to hearing the synthetic voice) under several conditions of background noise.
Bibliographic reference. Goff, Bertrand Le (1997): "Automatic modeling of coarticulation in text-to-visual speech synthesis", In EUROSPEECH-1997, 1667-1670.