Auditory-Visual Speech Processing (AVSP) 2010
Hakone, Kanagawa, Japan
Automatic speech recognition systems that utilise the visual modality of speech often are investigated within a speakerdependent or a multi-speaker paradigm. That is, during training the recogniser will have had prior exposure to example speech from each of the possible test speakers. In a previous paper we highlighted the danger of not using different speakers in the training and test sets, and demonstrated that, within a speakerindependent configuration, lip-reading performance degrades dramatically due to the speaker variability encoded in the visual features. In this paper, we examine feature improvement techniques to reduce speaker variability. We demonstrate that, by careful choice of technique, the effects of inter-speaker variability in the visual features can be reduced, which improves significantly the recognition accuracy of an automated lip-reading system. However, the performance of the lip-reading system still is significantly below that of acoustic speech recognition systems, and an analysis of the confusion matrices generated by the recogniser suggests this largely is due to the number of deletions apparent in a visual-only system.
Index Terms: lip-reading, feature extraction, feature comparison, speaker variability.
Bibliographic reference. Lan, Yuxuan / Theobald, Barry-John / Harvey, Richard / Ong, Eng-Jon / Bowden, Richard (2010): "Improving visual features for lip-reading", In AVSP-2010, paper S7-3.