ISCA Archive AVSP 2013
ISCA Archive AVSP 2013

Temporal integration for live conversational speech

Ragnhild Eg, Dawn M. Behne

The difficulty in detecting short asynchronies between corresponding audio and video signals demonstrates the remarkable resilience of the perceptual system when integrating the senses. Thresholds for perceived synchrony vary depending on the complexity, congruency and predictability of the audiovisual event. For instance, asynchrony is typically detected sooner for simple flash and tone combinations than for speech stimuli. In applied scenarios, such as teleconference platforms, the thresholds themselves are of particular interest; since the transmission of audio and video streams can result in temporal misalignments, system providers need to establish how much delay they can allow. This study compares the perception of synchrony in speech for a live two-way teleconference scenario and a controlled experimental set-up. Although methodologies and measures differ, our explorative analysis indicates that the windows of temporal integration are similar for the two scenarios. Nevertheless, the direction of temporal tolerance differs; for the teleconference, audio lead asynchrony was more difficult to detect than for the experimental speech videos. While the windows of temporal integration are fairly independent of the context, the skew in the audio lead threshold may be a reflection of the natural diversion of attending to a conversation.

Index Terms: audiovisual speech, temporal integration, synchrony perception, teleconference

Cite as: Eg, R., Behne, D.M. (2013) Temporal integration for live conversational speech. Proc. Auditory-Visual Speech Processing, 129-134

  author={Ragnhild Eg and Dawn M. Behne},
  title={{Temporal integration for live conversational speech}},
  booktitle={Proc. Auditory-Visual Speech Processing},