Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Stream Weight Optimization of Speech and Lip Image Sequence for Audio-Visual Speech Recognition

Satoshi Nakamura (1), Hidetoshi Ito (2), Kiyohiro Shikano (2)

(1) ATR Spoken Language Translation Research Laboratories, Seika-cho, Soraku-gun, Kyoto, Japan
(2) Graduate School of Information Science, Nara Institute of Science and Technology, Japan

Bimodal speech recognition systems, with the use of visual information to supplement acoustic information, have been shown to yield better recognition performance than purely acoustic systems, especially when background noise is present. The early integration strategy for HMM-based audio-visual speech recognition is one promising approach, where the output probability is obtaned by product of output probabilites of audio and visual streams. This paper addresses a novel method which optimizes stream weights so as to maximize recognition performance. The proposed method estimates the stream weights based on a normalized log likelihood which is derived by ratio of likelihood of a correct word and highest likelihood of incorrect words. The isolated word recognition experiment results show that the audio-visual speech recognition by proposed method attains 56.2% (10 dB), 55.2% (0dB) and 15.2% (20dB) better performance compared to that only using audio information. The results also show the proposed method can reduce a number of adaptation words.

Full Paper

Bibliographic reference.  Nakamura, Satoshi / Ito, Hidetoshi / Shikano, Kiyohiro (2000): "Stream weight optimization of speech and lip image sequence for audio-visual speech recognition", In ICSLP-2000, vol.3, 20-24.