In this paper, speaker adaptation is investigated for audiovisual automatic speech recognition (ASR) using the multistream hidden Markov model (HMM). First, audio-only and visual-only HMM parameters are adapted by combining maximum a posteriori and maximum likelihood linear regression adaptation. Subsequently, the audio-visual HMM stream exponents are adapted to better capture the reliability of each modality for the specific speaker, by means of discriminative training. Various visual feature sets are compared, and features based on linear discriminant analysis are demonstrated to result in superior multispeaker and speaker-adapted recognition performance. In addition, visual feature mean normalization is shown to significantly improve visual-only and audio-visual ASR performance. Adaptation experiments on a 49-subject database are reported. On average, a 28% relative word error reduction is achieved by adapting the multi-speaker audiovisual HMM to each subject in the database.
Cite as: Potamianos, G., Potamianos, A. (1999) Speaker adaptation for audio-visual speech recognition. Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999), 1291-1294, doi: 10.21437/Eurospeech.1999-303
@inproceedings{potamianos99_eurospeech, author={Gerasimos Potamianos and Alexandros Potamianos}, title={{Speaker adaptation for audio-visual speech recognition}}, year=1999, booktitle={Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999)}, pages={1291--1294}, doi={10.21437/Eurospeech.1999-303} }