INTERSPEECH 2010
11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Audio-Visual Anticipatory Coarticulation Modeling by Human and Machine

Louis H. Terry (1), Karen Livescu (2), Janet B. Pierrehumbert (1), Aggelos K. Katsaggelos (1)

(1) Northwestern University, USA
(2) TTIC, USA

Anticipatory coarticulation provides a basis for the observed asynchrony between the acoustic and visual onsets of phones in certain linguistic contexts and is typically not explicitly modeled in audio-visual speech models. We study within-word audio-visual asynchrony using hand labeled words in which theory suggests that asynchrony should occur, and show that these labels confirm the theory. We introduce a new statistical model of AV speech, the asynchrony-dependent transition (ADT) model that allows asynchrony between AV states within word boundaries, where the state transitions depend on the instantaneous asynchrony as well as the modality's state. This model outperforms a baseline synchronous model in mimicking the hand labels in a forced alignment task, and its behavior as parameters are changed conforms to our expectations about anticipatory coarticulation. The same model could be used for ASR, although here we consider it for the task of forced alignment for linguistic analysis.

Full Paper

Bibliographic reference.  Terry, Louis H. / Livescu, Karen / Pierrehumbert, Janet B. / Katsaggelos, Aggelos K. (2010): "Audio-visual anticipatory coarticulation modeling by human and machine", In INTERSPEECH-2010, 2682-2685.