Auditory-Visual Speech Processing (AVSP) 2011

Volterra, Italy
September 1-2, 2011

Audiovisual Streaming in Voicing Perception: New Evidence for a Low-Level Interaction between Audio and Visual Modalities

Frédéric Berthommier, Jean-Luc Schwartz

GIPSA-Lab, UMR CNRS 5216 – Grenoble University, France

Speech Audio-visual (AV) interaction has been considered for redundancy and complementary properties at the phonetic level but a few experiments have shown a significant role in early auditory analysis. A new paradigm is proposed which uses the pre-voicing component (PVC) excised from a true /b/. When the so called target PVC is added up to a /p/ this leads to the clear perception of /b/. Moreover, the amplitude variation of the target PVC allows building of a perceptual continuum between /p/ when amplitude is set at 0 and /b/ at original amplitude. In the audio channel, adding a series of PVC at fixed low amplitude before and after the target allows the creation of a stream of regular sounds, which are not related to visible events. On the contrary, the bilabial aperture of the /p/ is a specific speech gesture visible in the video channel. The target PVC and the visible gesture are also not redundant events. Then, depending on its intensity level, the target PVC added to an audio /p/ could be either embedded to a stream of other PVCs or phonetically fused to perceive /b/. To study the competition between these two alternatives and the role of the AV interaction, we use a 2*2 factorial design to contrast Clear/Stream and Audio/AV conditions with a control of the amplitude of the target PVC. There is no stream of PVCs in the “Clear” condition for providing the baseline. The streaming effect by itself is significant in the audio condition, but the novelty is that we find a strong AV interaction. When a stream of PVCs is present, in the “AV” condition, the rate of perceived /p/ is higher than in the “Audio” condition, suggesting that the video lip opening gestures increases the trend to isolate the formant trajectory towards the vowel from the PVC, hence increasing the perception of unvoiced stimuli. We conclude that the process of low level audio streaming is reinforced when the visual information is not redundant, and that, in this case, the phonetic fusion of the voicing cue is disadvantaged by visual information.

Index Terms. auditory streaming, voicing perception, multisensory fusion, AV speech perception, scene analysis

Bibliographic reference.  Berthommier, Frédéric / Schwartz, Jean-Luc (2011): "Audiovisual streaming in voicing perception: new evidence for a low-level interaction between audio and visual modalities", In AVSP-2011, 77-80.