Auditory-Visual Speech Processing 2007 (AVSP2007)

Kasteel Groenendaal, Hilvarenbeek, The Netherlands
August 31 - September 3, 2007

An Audio-Visual Speech Recognition Framework Based on Articulatory Features

Tian Gan (1), Wolfgang Menzel (1), Shiqiang Yang (2)

(1) Department of Informatics, University of Hamburg, Germany
(2) Department of Computer Science and Technology, Tsinghua University, China

This paper presents an audio-visual speech recognition framework based on articulatory features, which tries to combine the advantages of both areas, and shows a better recognition accuracy compared to a phone-based recognizer. In our approach, we use HMMs to model abstract articulatory classes, which are extracted in parallel from both the speech signal and the video frames. The N-best outputs of these independent classifiers are combined to decide on the best articulatory feature tuples. By mapping these tuples to phones, a phone stream can be generated. A lexical search finally maps this phone stream to meaningful word transcriptions. We demonstrate the potential of our approach by a preliminary experiment on the GRID database, which contains continuous English voice commands for a small vocabulary task.

Full Paper

Bibliographic reference.  Gan, Tian / Menzel, Wolfgang / Yang, Shiqiang (2007): "An audio-visual speech recognition framework based on articulatory features", In AVSP-2007, paper P01.