AVSP 2003 - International Conference on Audio-Visual Speech Processing

September 4-7, 2003
St. Jorioz, France

Audio-Visual Speech Recognition Using Lip Movement Extracted from Side-Face Images

Tomoaki Yoshinaga, Satoshi Tamura, Koji Iwano, Sadaoki Furui

Department of Computer Science, Tokyo Institute of Technology, Japan

This paper proposes an audio-visual speech recognition method using lip movement extracted from side-face images to attempt to increase noise-robustness in mobile environments. Although most previous bimodal speech recognition methods use frontal face (lip) images, these methods are not easy for users since they need to hold a device with a camera in front of their face when talking. Our proposed method capturing lip movement using a small camera installed in a handset is more natural, easy and convenient. This method also effectively avoids a decrease of signal-to-noise ratio (SNR) of input speech. Visual features are extracted by optical-flow analysis and combined with audio features in the framework of HMM-based recognition. Phone HMMs are built by the multi-stream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions, and the best improvement is approximately 6% at 5dB SNR.


Full Paper

Bibliographic reference.  Yoshinaga, Tomoaki / Tamura, Satoshi / Iwano, Koji / Furui, Sadaoki (2003): "Audio-visual speech recognition using lip movement extracted from side-face images", In AVSP 2003, 117-120.