Auditory-Visual Speech Processing
We investigate the use of visual, mouth-region information in improving automatic speech recognition (ASR) of the speech impaired. Given the video of an utterance by such a subject, we first extract appearance-based visual features from the mouth region-of-interest, and we use a feature fusion method to combine them with the subject's audio features into bimodal observations. Subsequently, we adapt the parameters of a speaker-independent, audio-visual hidden Markov model, trained on a large database of hearing subjects, to the audio-visual features extracted from the speech impaired videos. We consider a number of speaker adaptation techniques, and we study their performance in the case of a single speech impaired subject uttering continuous read speech, as well as connected digits. For both tasks, maximum-a-posteriori adaptation followed by maximum likelihood linear regression performs the best, achieving a word error rate relative reduction of 61% and 96%, respectively, over unadapted audio-visual ASR, and a 13% and 58% relative reduction over audio-only speaker-adapted ASR. In addition, we compare audio-only and audio-visual speaker-adapted ASR of the single speech impaired subject to ASR of subjects with normal speech, over a wide range of audio channel signal-to-noise ratios. Interestingly, for the small-vocabulary connected digits task, audio-visual ASR performance is almost identical across the two populations.
Bibliographic reference. Potamianos, Gerasimos / Neti, Chalapathy (2001): "Automatic speechreading of impaired speech", In AVSP-2001, 177-182.