The improvement of detectability by visible speech cues found by Grant and Seitz (JASA, 108:1197-1208, 2000) has been related to the degree of correlation between acoustic envelopes and visible movements. This suggests that the audio and visual signals could interact early during the audio-visual perceptual process on the basis of audio envelope cues. On the other hand, acoustic-visual correlations were previously reported by Yehia et al. (Speech Communication, 26(1):23-43, 1998). Taking into account these two main facts, the problem of extraction of the redundant audio-visual components is revisited: The video parametrization of natural images and three types of audio parameters are tested together, leading to new and realistic applications in video synthesis and audiovisual speech enhancement. Consistently with Grant and Seitz prediction, the 4-subbands envelope energy features are found to be optimal for encoding the redundant components available for the enhancement task. The computational model of audio-visual interaction which is proposed is based on the product, in the audio pathway, between the time-aligned audio envelopes and video-predicted envelopes. This interaction scheme is shown to be phonetically neutral, so that it will not bias the phonetic identification. Then, the low-level stage which is described is compatible with a late integration process, and this is a potential front-end for speech recognition applications.
Cite as: Berthommier, F. (2003) A phonetically neutral model of the low-level audiovisual interaction. Proc. Auditory-Visual Speech Processing, 89-94
@inproceedings{berthommier03_avsp, author={Frédéric Berthommier}, title={{A phonetically neutral model of the low-level audiovisual interaction}}, year=2003, booktitle={Proc. Auditory-Visual Speech Processing}, pages={89--94} }