AVSP 2003 - International Conference on Audio-Visual Speech Processing

September 4-7, 2003
St. Jorioz, France

Exploring the Spatial Frequency Requirements of Audio-Visual Speech Using Superimposed Facial Motion

Douglas M. Shiller (1), Christian Kroos (2), Eric Vatikiotis-Bateson (3), K. G. Munhall (4)

(1) Queen's University, Canada - (2) Munich University, Germany - (3) University of British Columbia, Canada - (4) ATR Human Information Science Laboratories, Japan

While visually complex stimuli such as human faces contain information across a wide range of spatial frequencies, information related to specific perceptual judgements may be concentrated in distinct spatial frequency bands. For example, previous work on static face perception has shown that face recognition relies primarily on low spatial frequency information while other tasks, such as identifying facial expressions, may require higher spatial frequencies. An innovative approach to identifying such spatial frequency biases has been the use of hybrid visual stimuli: stimuli that involve the overlap of two distinct images, one of which has been spatially filtered to remove high spatial-frequency information (i.e., low-pass filtered) and another which has been filtered to remove low-frequency information (i.e., high-pass filtered) (Schyns and Oliva, 1999). By placing these two spatial-frequency portions of the image in direct competition with each other, the use of hybrid stimuli allows for the identification of spatial frequency bands that are preferentially processed by the visual system and not merely sufficient for the task.

In this paper, we have used a similar technique to explore the range of spatial frequencies that are involved in the processing of audio-visual speech. We produced dynamic hybrid stimuli in which two video sequences of a talker producing different VCV utterances (such as 'aba' and 'aga') were spatially low- and highpass filtered at a number of cutoff frequencies (ranging from 2.75 to 44 cycles/face) and then combined. The resulting hybrids were presented to subjects with a single audio signal that was congruent with one of the two utterances. Thus, subjects were presented with two visual alternatives for audio-visual integration, one of which would produce the McGurk effect. In a separate condition, subjects were presented with individual lowpass or high-pass filtered video sequences on their own (i.e., half of the hybrid stimuli) with either congruent or incongruent audio.

Results suggest that visual information sufficient for speech perception is represented across a broad range of spatial frequencies. The McGurk effect was observed for nearly all of the individual low- and high-pass filtered stimuli. The hybrid stimuli, however, revealed a clear low spatial frequency bias in the processing of visual speech information. These results will be discussed in the context of the information requirements of face-to-face

Bibliographic reference.  Shiller, Douglas M. / Kroos, Christian / Vatikiotis-Bateson, Eric / Munhall, K. G. (2003): "Exploring the spatial frequency requirements of audio-visual speech using superimposed facial motion", Abstract, In AVSP 2003, 257.