ISCA Archive ICSLP 2000
ISCA Archive ICSLP 2000

Stream weight optimization of speech and lip image sequence for audio-visual speech recognition

Satoshi Nakamura, Hidetoshi Ito, Kiyohiro Shikano

Bimodal speech recognition systems, with the use of visual information to supplement acoustic information, have been shown to yield better recognition performance than purely acoustic systems, especially when background noise is present. The early integration strategy for HMM-based audio-visual speech recognition is one promising approach, where the output probability is obtaned by product of output probabilites of audio and visual streams. This paper addresses a novel method which optimizes stream weights so as to maximize recognition performance. The proposed method estimates the stream weights based on a normalized log likelihood which is derived by ratio of likelihood of a correct word and highest likelihood of incorrect words. The isolated word recognition experiment results show that the audio-visual speech recognition by proposed method attains 56.2% (10 dB), 55.2% (0dB) and 15.2% (20dB) better performance compared to that only using audio information. The results also show the proposed method can reduce a number of adaptation words.


Cite as: Nakamura, S., Ito, H., Shikano, K. (2000) Stream weight optimization of speech and lip image sequence for audio-visual speech recognition. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 3, 20-24

@inproceedings{nakamura00_icslp,
  author={Satoshi Nakamura and Hidetoshi Ito and Kiyohiro Shikano},
  title={{Stream weight optimization of speech and lip image sequence for audio-visual speech recognition}},
  year=2000,
  booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)},
  pages={vol. 3, 20-24}
}