International Conference on Auditory-Visual Speech Processing 2008

Tangalooma Wild Dolphin Resort, Moreton Island, Queensland, Australia
September 26-29, 2008

Effect of Audio-Visual Asynchrony between Time-Expanded Speech and a Moving Image of a Talkerís Face on Detection and Tolerance Thresholds

Shuichi Sakamoto (1), Akihiro Tanaka (2), Shun Numahata (1), Atsushi Imai (3), Tohru Takagi (3), YŰiti Suzuki (1)

(1) Research Institute of Electrical Communication and Graduate School of Information Sciences, Tohoku University, Sendai, Japan
(2) Graduate School of Humanities and Sociology, The University of Tokyo, Tokyo, Japan
(3) NHK Science and Technical Research Laboratories, Tokyo, Japan

In this study, we measured detection and tolerance thresholds of auditory-visual asynchrony between time-expanded speech and a moving image of the talkerís face. During experiments, words were presented under two conditions: asynchrony by time-expanded speech (expansion condition: EXP) and simple timing shift (asynchronous condition: ASYN). We used 16 Japanese shorter words (four morae) and 20 Japanese longer words (seven or eight morae). All auditory speech was presented in pink noise to avoid the ceiling effect. The SNRs for shorter and longer words were respectively set to -10 dB and -3.5 dB. For EXP, auditory speech signals were analyzed and resynthesized using STRAIGHT to change the wordsí duration (Kawahara et al., 1998). The resynthesized auditory signals were combined with the visual signals so that the onset of the utterance was synchronous. For ASYN, the auditory speech signal was simply lagged behind the visual speech signal. Results showed that detection and tolerance thresholds in longer words were higher than those for shorter words. However, when the threshold was recalculated as a function of the ratio of the expansion rate to word duration, these differences were not observed. These results suggest that detection and tolerance thresholds for auditory-visual asynchrony between timeexpanded speech and a moving image of talkerís face might depend on the ratio of the expansion rate to word duration.

