FAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and
Auditory-Visual Speech Processing

Vienna, Austria
September 11-13, 2015

Discovering Patterns in Visual Speech

Stephen Cox

School of Computing Sciences, University of East Anglia, Norwich, UK

We know that an audio speech signal can be unambiguously decoded by any native speaker of the language it is uttered in, provided that it meets some quality conditions. But we do not know if this is the case with visual speech, because the process of lipreading is rather mysterious and seems to rely heavily on the use of context and non-speech cues. How much information about the speech content is there in a visual speech signal? We make some attempt to provide an answer to this question by ‘discovering’ matching segments of phoneme sequences that represent recurring words and phrases in audio and visual representations of the same speech. We use a modified version of the technique of segmental dynamic programming that was introduced by Park and Glass. Comparison of the results shows that visual speech displays rather less matching content than the audio, and reveals some interesting differences in the phonetic content of the information recovered by the two modalities. Index Terms: automatic lip reading, visual speech processing, speech recognition

Full Paper

Bibliographic reference.  Cox, Stephen (2015): "Discovering patterns in visual speech", In FAAVSP-2015, 121-126.