Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Modeling Visual Coarticulation in Synthetic Talking Heads Using a Lip Motion Unit Inventory With Concatenative Synthesis

Steve Minnis (1), Andrew Breen (2)

(1) BT Adastral Park, (2) School of Information Systems, UEA, UK

The shape and synchronization of the lip movement with speech seems to be one of the important factors in the acceptability of a synthetic persona, particularly as synthetic beings approach human photo-realism. Most of us cannot lipread nor easily identify a sound by lip-shape alone, but we can readily detect whether the lip movements of a synthetic talking head are acceptable or not. This is true even when the viewer/listener is a considerable distance from the speaker. In addition, experiments have shown that visible synthetic speech is important in augmenting audible synthetic speech, in terms of ease of understandability and recognition accuracy. This is particularly true in noisy conditions where the audio signal is degraded [1]. Synthesizing the right lip movements for talking heads is therefore an important task in achieving a high degree of naturalness, as well as for potential applications where they provide assistance to hearing impaired individuals.

One of the major challenges, in speech synthesis as well as lip-motion synthesis is in the modelling of coarticulation. Coarticulation is the influence on the articulation of a speech segment of the preceding (backward/retentive coarticulation) and following speech segments (forward/anticipatory coarticulation). Coarticulation effects in speech have been shown to effect speech sounds up to 6 segments away [2].

Various techniques have been used to model visual coarticulation, all of which make assumptions about the degree of forward and backward influences and the way in which these are modeled - from simple additive influences to complex mathematical models. Usually these models are physiologically grounded; for example the speed at which mouth shape muscles can react may be one important factor. However, rule based models are by their very nature complex, since the physiology of the visible articulation musculature is also complex.

Rather than explicitly modelling this face physiology, we present a data-driven method where the dynamics of the facial musculature is captured in synchronization with the acoustic data. This approach is an improvement on other data-driven techniques [3,4] as it allows us to model visual coarticulatory effects as an extension of a concatenative speech synthesis unit selection process. Concatenative synthesis relies on the ability to extract appropriate contextual (hence capturing coarticulatory effects) N-phone units of speech which are then concatenated and deformed based on linguistic criteria - for example if stress or appropriate pitch change and duration changes are required for intonation. Our hypothesis is that these linguistic criteria are also applicable to the visual lipsynthesis in a similar way. This paper investigates how the visual unit selection process is realized.

References

  1. Cohen, M. M. and Massaro, D. W. Modeling Coarticulation in Synthetic Visual Speech, in Thalmann, N.M and Thalmann, D. (Eds.) Models and Techniques in Computer Animation, pp. 131-156, Tokyo-Springer, 1993.
  2. Kent, R.D. and Minifie, F.D. Coarticulation in Recent Speech production Models, Journal of Phonetics, 5, 115-133, 1977.
  3. Breen, A. P., Bowers, E., Welsh, W. An Investigation into the Generation of Mouth Shapes for a Talking Head, ICSLP '96, October, 1996.
  4. Breen, A.P, Gloaguen, O., Stern, P., A Fast Method of Producing Talking Head Mouth Shapes from Real Speech., Proc. ICSLP '98, November, 1998.


Full Paper

Video [AVI; 1.6 MB]

Bibliographic reference.  Minnis, Steve / Breen, Andrew (2000): "Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis", In ICSLP-2000, vol.2, 759-762.