Sixth International Conference on Spoken Language Processing
October 16-20, 2000
Stream Weight Optimization of Speech and Lip Image Sequence for Audio-Visual Speech Recognition
Satoshi Nakamura (1), Hidetoshi Ito (2), Kiyohiro Shikano (2)
(1) ATR Spoken Language Translation Research Laboratories,
Seika-cho, Soraku-gun, Kyoto, Japan
Bimodal speech recognition systems, with the use of visual
information to supplement acoustic information, have
been shown to yield better recognition performance than
purely acoustic systems, especially when background noise
is present. The early integration strategy for HMM-based
audio-visual speech recognition is one promising approach,
where the output probability is obtaned by product of output
probabilites of audio and visual streams. This paper
addresses a novel method which optimizes stream weights
so as to maximize recognition performance. The proposed
method estimates the stream weights based on a normalized
log likelihood which is derived by ratio of likelihood
of a correct word and highest likelihood of incorrect words.
The isolated word recognition experiment results show that
the audio-visual speech recognition by proposed method
attains 56.2% (10 dB), 55.2% (0dB) and 15.2% (20dB)
better performance compared to that only using audio information.
The results also show the proposed method can
reduce a number of adaptation words.
(2) Graduate School of Information Science, Nara Institute of Science and Technology, Japan
Nakamura, Satoshi / Ito, Hidetoshi / Shikano, Kiyohiro (2000):
"Stream weight optimization of speech and lip image sequence for audio-visual speech recognition",
In ICSLP-2000, vol.3, 20-24.