ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Audio-visual multilevel fusion for speech and speaker recognition

Girija Chetty, Michael Wagner

In this paper we propose a robust audio-visual speech-and-speaker recognition system with liveness checks based on audio-visual fusion of audio-lip motion and depth features. The liveness verification feature added here guards the system against advanced spoofing attempts such as manufactured or replayed videos. For visual features, a new tensor-based representation of lip motion features, extracted from an intensity and depth subspace of 3D video sequences, is fused used with the audio features. A multilevel fusion paradigm involving first a Support Vector Machine for speech (digit) recognition and then a Gaussian Mixture Model for speaker verification with liveness checks allowed a significant performance improvement over single-mode features. Experimental evaluation for different scenarios with AVOZES, a 3D stereovision speaking-face database, shows favourable results with recognition accuracies of 70.90% for the digit recognition task, and EERs of 5% and 3% for the speaker verification and liveness check tasks respectively.


doi: 10.21437/Interspeech.2008-153

Cite as: Chetty, G., Wagner, M. (2008) Audio-visual multilevel fusion for speech and speaker recognition. Proc. Interspeech 2008, 379-382, doi: 10.21437/Interspeech.2008-153

@inproceedings{chetty08_interspeech,
  author={Girija Chetty and Michael Wagner},
  title={{Audio-visual multilevel fusion for speech and speaker recognition}},
  year=2008,
  booktitle={Proc. Interspeech 2008},
  pages={379--382},
  doi={10.21437/Interspeech.2008-153}
}