Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Labeling Audio-Visual Speech Corpora and Training an ANN/HMM Audio-Visual Speech Recognition System

Martin Heckmann (1), Frédéric Berthommier (2), Christophe Savario (2), Kristian Kroschel (1)

(1) Institut für Nachrichtentechnik, Universität Karlsruhe, Germany
(2)Institut de la Communication Parlée (ICP), Institut Nationale Polytechnique de Grenoble , France

We present a method to label an audio-visual database and to setup a system for audio-visual speech recognition based on a hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) approach.

The multi-stage labeling process is presented on a new audiovisual database recorded at the Institute de la Communication Parlée (ICP). The database was generated via transposition of the audio database NUMBERS95. For the labeling first a large subset of NUMBERS95 is used to achieve a bootstrap training of an ANN, which can then be employed to label the audio part of the audio-visual database. This initial labeling is further improved via readapting the ANN to the new database and reperforming the labeling. From the audio labeling then the video labeling is derived.

Tests at different Signal to Noise Ratios (SNR) are performed to demonstrate the efficiency of the labeling process. Furthermore ways to incorporate information from a large audio database into the final audio-visual recognition system were investigated.

Full Paper

Bibliographic reference.  Heckmann, Martin / Berthommier, Frédéric / Savario, Christophe / Kroschel, Kristian (2000): "Labeling audio-visual speech corpora and training an ANN/HMM audio-visual speech recognition system", In ICSLP-2000, vol.4, 9-12.