Auditory-Visual Speech Processing (AVSP) 2013

Annecy, France
August 29 - September 1, 2013

Integration of Acoustic and Visual Cues in Prominence Perception

Hansjörg Mixdorff (1), Angelika Hönemann (1), Sascha Fagel (2)

(1) Department of Computer Science and Media, Beuth University Berlin, Germany
(2) Zoobe Message Entertainment GmbH, Berlin, Germany

This study concerns the perception of prominence in auditoryvisual speech perception. We constructed A/V stimuli from five-syllable sentences in which every syllable was a candidate for receiving stress. All syllables were of uniform length, and the F0 contours were manipulated using the Fujisaki model, moving a peak of F0 from the beginning to the end of the utterance. The peak was either aligned with the center of the syllable or the boundary between syllables, yielding a total of nine positions. Likewise, a video showing the upper part of a speaker’s face exhibiting one single raise of eyebrows was aligned with the audio, hence yielding nine positions for the visual cue, with the maximum displacement of the eyebrows coinciding with syllable centers or boundaries. Another series of stimuli was produced with head nods as the visual cue. In addition stimuli with constant F0 with or without video were created. 22 German native subjects rated the strength of each of the five syllables in a stimulus on a scale from 1-3. Results show that the acoustic prominence outweighs the visual one, and that the integration of both in a single syllable is the strongest when the movement as well as the F0 peak are aligned with the center of the syllable. However, F0 peaks aligned with the right boundary of the accented syllable, as well as visual peaks aligned with the left one also boost prominence considerably. Nods had an effect similar in magnitude as eye brow movements, however, results suggest that they rather have to be aligned with the right boundary of the syllable than the left one.

Index Terms: Prominence, auditory-visual integration, F0 modeling

