FAAVSP - The 1st Joint Conference on
Facial Analysis, Animation, and
This paper presents preliminary experiments using the Kaldi toolkit to investigate audiovisual speech recognition (AVSR) in noisy environments using deep neural networks (DNNs). In particular we use a single-speaker large vocabulary, continuous audiovisual speech corpus to compare the performance of visual-only, audio-only and audiovisual speech recognition. The models trained using the Kaldi toolkit are compared with the performance of models trained using conventional hidden Markov models (HMMs). In addition, we compare the performance of a speech recognizer both with and without visual features over nine different SNR levels of babble noise ranging from 20dB down to -20dB. The results show that the DNN outperforms conventional HMMs in all experimental conditions, especially for the lip-reading only system, which achieves a gain of 37.19% accuracy (84.67% absolute word accuracy). Moreover, the DNN provides an effective improvement of 10 and 12dB SNR respectively for both the single modal and bimodal speech recognition systems. However, integrating the visual features using simple feature fusion is only effective in SNRs at 5dB and above. Below this the degradion in accuracy of an audiovisual system is similar to the audio only recognizer. Index Terms: lip-reading, speech reading, audiovisual speech recognition
Bibliographic reference. Thangthai, Kwanchiva / Harvey, Richard / Cox, Stephen / Theobald, Barry-John (2015): "Improving lip-reading performance for robust audiovisual speech recognition using DNNs", In FAAVSP-2015, 127-131.