COST278 and ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction

University of East Anglia, Norwich, UK
August 30-31, 2004

Audio-Visual Speech Recognition Using New Lip Features Extracted from Side-Face Images

Tomoaki Yoshinaga, Satoshi Tamura, Koji Iwano, Sadaoki Furui

Department of Computer Science, Tokyo Institute of Technology, Japan

This paper proposes new visual features for audio-visual speech recognition using lip information extracted from side-face images. In order to increase the noise-robustness of speech recognition, we have proposed an audio-visual speech recognition method using speaker lip information extracted from side-face images taken by a small camera installed in a mobile device. Our previous method used only movement information of lips, measured by optical-flow analysis, as a visual feature. However, since shape information of lips is also obviously important, this paper attempts to combine lip-shape information with lip-movement information to improve the audio-visual speech recognition performance. A combination of an angle value between upper and lower lips (lip-angle) and its derivative is extracted as lip-shape features. Effectiveness of the lip-angle features has been evaluated under various SNR conditions. The proposed features improved recognition accuracies in all SNR conditions in comparison with audio-only recognition results. The best improvement of 8.0% in absolute value was obtained at 5dB SNR condition. Combining the lip-angle features with our previous features extracted by the optical-flow analysis yielded further improvement. These visual features were confirmed to be effective even when the audio HMM used in our method was adapted to noise by the MLLR method.


Full Paper

Bibliographic reference.  Yoshinaga, Tomoaki / Tamura, Satoshi / Iwano, Koji / Furui, Sadaoki (2004): "Audio-visual speech recognition using new lip features extracted from side-face images", In Robust2004, paper 33.