Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Improving Lip-Reading with Feature Space Transforms for Multi-Stream Audio-Visual Speech Recognition

Jing Huang, Karthik Visweswariah

IBM T.J. Watson Research Center, Yorktown Heights, NY, USA

In this paper we investigate feature space transforms to improve lip-reading performance for multi-stream HMM based audio-visual speech recognition (AVSR). The feature space transforms include non-linear Gaussianization transform and feature space maximum likelihood linear regression (fMLLR). We apply Gaussianization at the various stages of visual front-end. The results show that Gaussianizing the final visual features achieves the best performance: 8% gain on lip-reading and 14% gain on AVSR. We also compare performance of speaker-based Gaussianization and global Gaussianization. Without fMLLR adaptation, speaker-based Gaussianization improves more on lip-reading and multi-stream AVSR performance. However, with fMLLR adaptation, global Gaussianization shows better results, and achieves 18% over baseline fMLLR adaptation for AVSR.

Full Paper

Bibliographic reference.  Huang, Jing / Visweswariah, Karthik (2005): "Improving lip-reading with feature space transforms for multi-stream audio-visual speech recognition", In INTERSPEECH-2005, 1221-1224.