16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Cross Database Training of Audio-Visual Hidden Markov Models for Phone Recognition

Shahram Kalantari, David Dean, Houman Ghaemmaghami, Sridha Sridharan, Clinton Fookes

Queensland University of Technology, Australia

Speech recognition can be improved by using visual information in the form of lip movements of the speaker in addition to audio information. To date, state-of-the-art techniques for audio-visual speech recognition continue to use audio and visual data of the same database for training their models. In this paper, we present a new approach to make use of one modality of an external dataset in addition to a given audio-visual dataset. By so doing, it is possible to create more powerful models from other extensive audio-only databases and adapt them on our comparatively smaller multi-stream databases. Results show that the presented approach outperforms the widely adopted synchronous hidden Markov models (HMM) trained jointly on audio and visual data of a given audio-visual database for phone recognition by 29% relative. It also outperforms the external audio models trained on extensive external audio datasets and also internal audio models by 5.5% and 46% relative respectively. We also show that the proposed approach is beneficial in noisy environments where the audio source is affected by the environmental noise.

Full Paper

Bibliographic reference.  Kalantari, Shahram / Dean, David / Ghaemmaghami, Houman / Sridharan, Sridha / Fookes, Clinton (2015): "Cross database training of audio-visual hidden Markov models for phone recognition", In INTERSPEECH-2015, 553-557.