Auditory-Visual Speech Processing 2007 (AVSP2007)

Kasteel Groenendaal, Hilvarenbeek, The Netherlands
August 31 - September 3, 2007

Maximising Audio-Visual Speech Correlation

Ibrahim Almajai, Ben Milner

School of Computing Sciences, University of East Anglia, UK

The aim of this work is to investigate a selection of audio and visual speech features with the aim of finding pairs that maximise audio-visual correlation. Two audio speech features have been used in the analysis - filterbank vectors and the first four formant frequencies. Similarly, three visual features have also been considered - active appearance model (AAM), 2-D DCT and cross-DCT. From a database of 200 sentences, audio and visual speech features have been extracted and multiple linear regression used to measure the audio-visual correlation. Results reveal filterbank features to exhibit multiple correlation of around R=0.8 to visual features, while formant frequencies show substantially less correlation to visual features - R=0.6 for formants 1 and 2 and less than R=0.4 for formants 3 and 4. The three visual features show almost identical correlation to the audio features, varying in multiple correlation by less than 0.1, even though the methods of visual feature extraction are very different. Measuring the audio-visual correlation within each phoneme and then averaging the correlation across all phonemes showed an increase in correlation to R=0.9.

Full Paper

Bibliographic reference.  Almajai, Ibrahim / Milner, Ben (2007): "Maximising audio-visual speech correlation", In AVSP-2007, paper P17.