Third International Conference on Spoken Language Processing (ICSLP 94)
We present recent work on integration of visual information (automatic lip-reading) with acoustic speech for better overall speech recognition. A Multi-State Time Delay Neural Network performs the recognition of spelled letter sequences taking advantage of lip images from a standard camera. The problems addressed include efficient but effective representation of the visual information and optimum manner of combining the two modalities when rendering a decision. We show results for several alternatives to direct gray level image as the visual evidence. These are: Principal Components, Linear Discriminants, and DFT coefficients. Dimensionality of the input is decreased by a factor of 12 while maintaining recognition rates. Combination of the visual and acoustic information is performed at three different levels of abstraction. Results suggest that integration of higher order input features works best. On a continuous spelling task, visual-alone recognition of 45-55%, when combined with acoustic data, lowers audio-alone error rates by 30-40%.
Bibliographic reference. Duchnowski, Paul / Meier, Uwe / Waibel, Alex (1994): "See me, hear me: integrating automatic speech recognition and lip-reading", In ICSLP-1994, 547-550.