Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Discriminatively Trained Features Using fMPE for Multi-Stream Audio-Visual Speech Recognition

Jing Huang, Daniel Povey

IBM T.J. Watson Research Center, Yorktown Heights, NY, USA

fMPE is a recently introduced discriminative training technique that uses the Minimum Phone Error (MPE) discriminative criterion to train a feature-level transformation. In this paper we investigate fMPE trained audio/visual features for multi-stream HMM-based audio-visual speech recognition. A flexible, layer-based implementation of fMPE allows us to combine the visual information with the audio stream using the discriminative training process, and dispense with the multiple stream approach. Experiments are reported on the IBM infrared headset audio-visual database. On average of 20-speaker 1 hour speaker independent test data, the fMPE trained acoustic features achieve 33% relative gain. Adding video layers on top of audio layers gives additional 10% gain over fMPE trained features from the audio stream alone. The fMPE trained visual features achieve 14% relative gain, while the decision fusion of audio/visual streams with fMPE trained features achieves 29% relative gain. However, fMPE trained models do not improve over the original models on the mismatched noisy test data.

Full Paper

Bibliographic reference.  Huang, Jing / Povey, Daniel (2005): "Discriminatively trained features using fMPE for multi-stream audio-visual speech recognition", In INTERSPEECH-2005, 777-780.