ISCA Archive Interspeech 2005
ISCA Archive Interspeech 2005

Improving lip-reading with feature space transforms for multi-stream audio-visual speech recognition

Jing Huang, Karthik Visweswariah

In this paper we investigate feature space transforms to improve lip-reading performance for multi-stream HMM based audio-visual speech recognition (AVSR). The feature space transforms include non-linear Gaussianization transform and feature space maximum likelihood linear regression (fMLLR). We apply Gaussianization at the various stages of visual front-end. The results show that Gaussianizing the final visual features achieves the best performance: 8% gain on lip-reading and 14% gain on AVSR. We also compare performance of speaker-based Gaussianization and global Gaussianization. Without fMLLR adaptation, speaker-based Gaussianization improves more on lip-reading and multi-stream AVSR performance. However, with fMLLR adaptation, global Gaussianization shows better results, and achieves 18% over baseline fMLLR adaptation for AVSR.


doi: 10.21437/Interspeech.2005-373

Cite as: Huang, J., Visweswariah, K. (2005) Improving lip-reading with feature space transforms for multi-stream audio-visual speech recognition. Proc. Interspeech 2005, 1221-1224, doi: 10.21437/Interspeech.2005-373

@inproceedings{huang05e_interspeech,
  author={Jing Huang and Karthik Visweswariah},
  title={{Improving lip-reading with feature space transforms for multi-stream audio-visual speech recognition}},
  year=2005,
  booktitle={Proc. Interspeech 2005},
  pages={1221--1224},
  doi={10.21437/Interspeech.2005-373}
}