7th International Conference on Spoken Language Processing

September 16-20, 2002
Denver, Colorado, USA

DCT-Based Video Features for Audio-Visual Speech Recognition

Martin Heckmann (1), Kristian Kroschel (1), Christophe Savariaux (2), Frédéric Berthommier (2)

(1) Universität Karlsruhe, Germany; (2) Institut de la Communication Parlée/INPG, France

Encouraged by the good performance of the DCT in audiovisual speech recognition [1], we investigate how the selection of the DCT coeffi- cients influences the recognition scores in a hybrid ANN/HMM audiovisual speech recognition system on a continuous word recognition task with a vocabulary of 30 numbers. Three sets of coefficients, based on the mean energy, the variance and the variance relative to the mean value, were chosen. The performance of these coefficients is evaluated in a video only and an audio-visual recognition scenario with varying Signal to Noise Ratios (SNR). The audio-visual tests are performed with 5 types of additional noise at 12 SNR values each. Furthermore the results of the DCT based recognition are compared to those obtained via chroma-keyed geometric lip features [2]. In order to achieve this comparison, a second audio-visual database without chroma-key has been recorded. This database has similar content but a different speaker.

Full Paper

Bibliographic reference.  Heckmann, Martin / Kroschel, Kristian / Savariaux, Christophe / Berthommier, Frédéric (2002): "DCT-based video features for audio-visual speech recognition", In ICSLP-2002, 1925-1928.