Auditory-Visual Speech Processing 2005

British Columbia, Canada
July 24-27, 2005

Cognitive Processing of Audiovisual Cues to Prominence

Marc Swerts, Emiel Krahmer

Tilburg University, The Netherlands

Speakers use both auditory markers (e.g., pitch accents, increased syllable durations, and visual markers (e.g., head nods and eyebrow movements, to indicate important words in an utterance. Auditory markers have stronger cue value for the observer visual ones, but visual markers also have a strong impact [1, 2]. Prominence judgement tasks with incongruent stimuli (utterances in which auditory and visual prominence markers are associated with different words) reveal that these lead to increased confusion among perceivers [3], and that such incongruencies are disliked, presumably because they are unnatural [4].

This paper addresses two related questions regarding prominence perception:

In a production experiment, native Dutch speakers produced a Dutch sentence, in a number of different conditions, each time with emphasis on a different word. A selection from these AV recordings was used for two perception experiments. In the first, AV recordings were manipulated such that auditory and visual accents were either congruent (on the same word) or incongruent (on different words). Speeded prominence judgement task results reveal that incongruent stimuli are processed more slowly than congruent stimuli, but only when participants perceived the auditory accented word as most prominent. Thus subjects are sensitive to visual information to prominence, even when they do not use this information in their actual choice. In the second perception experiment subjects were presented with production experiment materials in which sound and video were manipulated to create stimuli with monotonous pitch, but with a visual accent on either the first, second or third noun phrase. In addition, entire face, upper half, lower half, right half, or left half of the face were shown. Results show that the upper facial area has stronger cue value for prominence detection than the bottom part, and that the left part of the face is more important than the right part. We are currently exploring to what extent the results are due to localisation of speaker expressiveness [e.g., 5], or observer attentional effects [e.g., 6].

