Sixth International Conference on Spoken Language Processing
We present a method to label an audio-visual database and to setup a system for audio-visual speech recognition based on a hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM) approach.
The multi-stage labeling process is presented on a new audiovisual database recorded at the Institute de la Communication Parlée (ICP). The database was generated via transposition of the audio database NUMBERS95. For the labeling first a large subset of NUMBERS95 is used to achieve a bootstrap training of an ANN, which can then be employed to label the audio part of the audio-visual database. This initial labeling is further improved via readapting the ANN to the new database and reperforming the labeling. From the audio labeling then the video labeling is derived.
Tests at different Signal to Noise Ratios (SNR) are performed to demonstrate the efficiency of the labeling process. Furthermore ways to incorporate information from a large audio database into the final audio-visual recognition system were investigated.
Bibliographic reference. Heckmann, Martin / Berthommier, Frédéric / Savario, Christophe / Kroschel, Kristian (2000): "Labeling audio-visual speech corpora and training an ANN/HMM audio-visual speech recognition system", In ICSLP-2000, vol.4, 9-12.