Sixth International Conference on Spoken Language Processing
One of the major challenges, in speech synthesis as well as lip-motion synthesis is in the modelling of coarticulation. Coarticulation is the influence on the articulation of a speech segment of the preceding (backward/retentive coarticulation) and following speech segments (forward/anticipatory coarticulation). Coarticulation effects in speech have been shown to effect speech sounds up to 6 segments away .
Various techniques have been used to model visual coarticulation, all of which make assumptions about the degree of forward and backward influences and the way in which these are modeled - from simple additive influences to complex mathematical models. Usually these models are physiologically grounded; for example the speed at which mouth shape muscles can react may be one important factor. However, rule based models are by their very nature complex, since the physiology of the visible articulation musculature is also complex.
Rather than explicitly modelling this face physiology, we present a data-driven method where the dynamics of the facial musculature is captured in synchronization with the acoustic data. This approach is an improvement on other data-driven techniques [3,4] as it allows us to model visual coarticulatory effects as an extension of a concatenative speech synthesis unit selection process. Concatenative synthesis relies on the ability to extract appropriate contextual (hence capturing coarticulatory effects) N-phone units of speech which are then concatenated and deformed based on linguistic criteria - for example if stress or appropriate pitch change and duration changes are required for intonation. Our hypothesis is that these linguistic criteria are also applicable to the visual lipsynthesis in a similar way. This paper investigates how the visual unit selection process is realized.
Video [AVI; 1.6 MB]
Bibliographic reference. Minnis, Steve / Breen, Andrew (2000): "Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis", In ICSLP-2000, vol.2, 759-762.