AVSP 2003 - International Conference on Audio-Visual Speech Processing

September 4-7, 2003
St. Jorioz, France

Effects of Image Distortions on Audio-Visual Speech Recognition

Martin Heckmann (1), Frédéric Berthommier (2), Christophe Savariaux (2), Kristian Kroschel (1)

(1) Institut für Nachrichtentechnik, Universität Karlsruhe, Germany
(2) Institut de la Communication Parlée (ICP), Grenoble, France 1221

Audio-visual speech recognition leads to significant improvements compared to pure audio recognition especially when the audio signal is corrupted by noise. This has been reproduced by many researchers. Little research has been done on the behavior of audio-visual recognition with additional degradations of the video signal, however. In this article we investigate the consequences of different types of image degradations, namely white noise, a JPEG compression, and errors in the localization of the mouth region, on the audio-visual recognition process. The first question we address is how the noise in the video stream in- fluences the recognition scores. Therefore we added noise to both, the audio and video signal at different SNR levels. The second question is how the adaptation of the fusion parameter, controlling the contribution of the audio and video stream to the recognition, is affected by the additional noise in the video stream. We compare the results we obtain when we adapt the fusion parameter to the noise in the audio and video stream to those we get when it is only adapted to the noise in the audio stream and hence a clean video stream is assumed. For the second type of tests we use an automatic adaptation of the fusion parameter based on the entropy of the a-posteriori probabilities from the audio stream.


Full Paper

Presentation. Four videos can be viewed. Each sequence is composed of two parts:

  1. Video distorted without sound + a subtitle indicating the automatic identification obtained with video only
  2. Video distorted with sound

Three cases of distortion are shown:
  1. Video white noise at 0 dB Part 1 (5.9 MB)   Part 2 (5.0 MB)
  2. Quality factor set at 40 (2.2 MB)
  3. Misplaced (3.8 MB)

Bibliographic reference.  Heckmann, Martin / Berthommier, Frédéric / Savariaux, Christophe / Kroschel, Kristian (2003): "Effects of image distortions on audio-visual speech recognition", In AVSP 2003, 163-168.