Auditory-Visual Speech Processing (AVSP) 2010

Hakone, Kanagawa, Japan
September 30-October 3, 2010

Improving Visual Features for Lip-Reading

Yuxuan Lan (1), Barry-John Theobald (1), Richard Harvey (1), Eng-Jon Ong (2), Richard Bowden (2)

(1) School of Computing Sciences, University of East Anglia, UK
(2) School of Electronics and Physical Sciences, University of Surrey, UK

Automatic speech recognition systems that utilise the visual modality of speech often are investigated within a speakerdependent or a multi-speaker paradigm. That is, during training the recogniser will have had prior exposure to example speech from each of the possible test speakers. In a previous paper we highlighted the danger of not using different speakers in the training and test sets, and demonstrated that, within a speakerindependent configuration, lip-reading performance degrades dramatically due to the speaker variability encoded in the visual features. In this paper, we examine feature improvement techniques to reduce speaker variability. We demonstrate that, by careful choice of technique, the effects of inter-speaker variability in the visual features can be reduced, which improves significantly the recognition accuracy of an automated lip-reading system. However, the performance of the lip-reading system still is significantly below that of acoustic speech recognition systems, and an analysis of the confusion matrices generated by the recogniser suggests this largely is due to the number of deletions apparent in a visual-only system.

Index Terms: lip-reading, feature extraction, feature comparison, speaker variability.

Full Paper

Bibliographic reference.  Lan, Yuxuan / Theobald, Barry-John / Harvey, Richard / Ong, Eng-Jon / Bowden, Richard (2010): "Improving visual features for lip-reading", In AVSP-2010, paper S7-3.