Auditory-Visual Speech Processing (AVSP) 2010
Hakone, Kanagawa, Japan
Visual cues to speech prosody are available from a speakers face; however the form and location of such cues are likely to be inconsistent across speakers. Given this, the question arises whether such cues have enough in common to signal the same prosodic information across face areas and different speakers. To investigate this, the present study used visual-visual matching tasks requiring participants to view pairs of silent videos (with one video displaying the upper half, the other video showing the lower half of the face), and select the pair produced with the same prosody (different recorded tokens were used). Participants completed both a same-speaker (both upper and lower videos from the same speaker) and crossspeaker version (upper and lower videos originated from different speakers) of the task. Compared to same-speaker matching, performance was lower for cross-speaker matching but still much greater than chance (i.e., 50%). These results support the idea that visual correlates of prosody are encoded by perceivers as abstract, non-speaker specific cues that are transferable across repetitions, speakers and face areas.
Index Terms: visual prosody, perception, cross-speaker, same-speaker, face area, prosodic focus, prosodic phrasing.
Bibliographic reference. Cvejic, Erin / Kim, Jeesun / Davis, Chris (2010): "Abstracting visual prosody across speakers and face areas", In AVSP-2010, paper S3-1.