![]() |
Auditory-Visual Speech Processing 2007 (AVSP2007)Kasteel Groenendaal, Hilvarenbeek, The Netherlands |
![]() |
This paper presents an audio-visual speech recognition framework based on articulatory features, which tries to combine the advantages of both areas, and shows a better recognition accuracy compared to a phone-based recognizer. In our approach, we use HMMs to model abstract articulatory classes, which are extracted in parallel from both the speech signal and the video frames. The N-best outputs of these independent classifiers are combined to decide on the best articulatory feature tuples. By mapping these tuples to phones, a phone stream can be generated. A lexical search finally maps this phone stream to meaningful word transcriptions. We demonstrate the potential of our approach by a preliminary experiment on the GRID database, which contains continuous English voice commands for a small vocabulary task.
Bibliographic reference. Gan, Tian / Menzel, Wolfgang / Yang, Shiqiang (2007): "An audio-visual speech recognition framework based on articulatory features", In AVSP-2007, paper P01.