INTERSPEECH 2004 - ICSLP
This paper presents an audio-visual speaker-dependent continuous speech recognition system. The idea is extracting features from the audio- and the video- stream of a speaking person separately and use them to train a Hidden-Markov-Model based recognizer with the combined feature vectors. While the audio feature extraction follows a classical approach, the visual features are obtained by means of an advanced image processing algorithm which tracks certain regions on the speaker's lips with high robustness and accuracy. For a self-generated audio-visual database, we compare the recognition rates of audio only, video only and audio-visual based recognition systems. We compare the results of the audio and the audiovisual systems under different noise conditions. The work is part of a larger project which aims at a new man-machine interface in the form of a so-called Virtual Personal Assistant which communicates with the user based on the multimodal integration of natural communication channels.
Bibliographic reference. Martinez, Maria Josť Sanchez / Gutierrez, Juan Pablo de la Cruz (2004): "Speech recognition using motion based lipreading", In INTERSPEECH-2004, 2513-2516.