In this paper we propose a robust audio-visual speech-and-speaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70.90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.
Bibliographic reference. Chetty, Girija / Wagner, Michael (2008): "Audio-visual multilevel fusion for speech and speaker recognition", In INTERSPEECH-2008, 379-382.