International Conference on Auditory-Visual Speech Processing 2008
Tangalooma Wild Dolphin Resort,
Moreton Island, Queensland, Australia
In speech recognition, the problem of speaker variability has been well studied. Common approaches to dealing with it include normalising for a speakerís vocal tract length and learning a linear transform that moves the speaker-independent models closer to to a new speaker. In pure lip-reading (no audio) the problem has been less well studied. Results are often presented that are based on speaker-dependent (single speaker) or multispeaker (speakers in the test-set are also in the training-set) data, situations that are of limited use in real applications. This paper shows the danger of not using different speakers in the trainingand test-sets. Firstly, we present classification results on a new single-word database AVletters 2 which is a high-definition version of the well known AVletters database. By careful choice of features, we show that it is possible for the performance of visual-only lip-reading to be very close to that of audio-only recognition for the single speaker and multi-speaker configurations. However, in the speaker independent configuration, the performance of the visual-only channel degrades dramatically. By applying multidimensional scaling (MDS) to both the audio features and visual features, we demonstrate that lip-reading visual features, when compared with the MFCCs commonly used for audio speech recognition, have inherently small variation within a single speaker across all classes spoken. However, visual features are highly sensitive to the identity of the speaker, whereas audio features are relatively invariant.
Bibliographic reference. Cox, Stephen / Harvey, Richard / Lan, Yuxuan / Newman, Jacob / Theobald, Barry-John (2008): "The challenge of multispeaker lip-reading", In AVSP-2008, 179-184.