10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Human Audio-Visual Consonant Recognition Analyzed with Three Bimodal Integration Models

Zhanyu Ma, Arne Leijon

KTH, Sweden

With A-V recordings, ten normal hearing people took recognition tests at different signal-to-noise ratios (SNR). The A-V recognition results are predicted by the fuzzy logical model of perception (FLMP) and the post-labelling integration model (POSTL). We also applied hidden Markov models (HMMs) and multi-stream HMMs (MSHMMs) for the recognition. As expected, all the models agree qualitatively with the results that the benefit gained from the visual signal is larger at lower acoustic SNRs. However, the FLMP severely overestimates the A-V integration result, while the POSTL model underestimates it. Our automatic speech recognizers integrated the audio and visual stream efficiently. The visual automatic speech recognizer could be adjusted to correspond to human visual performance. The MSHMMs combine the audio and visual streams efficiently, but the audio automatic speech recognizer must be further improved to allow precise quantitative comparisons with human audio-visual performance.

Full Paper

Bibliographic reference.  Ma, Zhanyu / Leijon, Arne (2009): "Human audio-visual consonant recognition analyzed with three bimodal integration models", In INTERSPEECH-2009, 812-815.