Auditory-Visual Speech Processing 2007 (AVSP2007)

Kasteel Groenendaal, Hilvarenbeek, The Netherlands
August 31 - September 3, 2007

Audio-Visual Speech Fragment Decoding

Jon Barker, Xu Shao

Department of Computer Science, University of Sheffield, Sheffield, UK

This paper presents a robust speech recognition technique called audio-visual speech fragment decoding (AV-SFD), in which the visual signal is exploited both as a cue for source separation and as a carrier of phonetic information. The model builds on the existing audio-only SFD technique which, based on the auditory scene analysis account of perceptual organisation, works by combining a bottom-up layer which identifies sound fragments, and a model-driven layer which searches for fragment groupings that can be interpreted as recognisable speech utterances. In AV-SFD, the visual signal is used in the model-driven stage improving the ability of the decoder to distinguish between foreground and background fragments. The system has been evaluated using an audio-visual version of Pascal Speech Separation Challenge. At low SNRs, recognition error rates are reduced by around 20% relative to the performance of a conventional multistream AV-ASR system.

Full Paper

Bibliographic reference.  Barker, Jon / Shao, Xu (2007): "Audio-visual speech fragment decoding", In AVSP-2007, paper L5-2.