Speaker Adaptation for Lip-Reading Using Visual Identity Vectors

Pujitha Appan Kandala, Abhinav Thanda, Dilip Kumar Margam, Rohith Chandrashekar Aralikatti, Tanay Sharma, Sharad Roy, Shankar M. Venkatesan


Visual speech recognition or lip-reading suffers from high word error rate (WER) as lip-reading is based solely on articulators that are visible to the camera. Recent works mitigated this problem using complex architectures of deep neural networks. I-vector based speaker adaptation is a well known technique in ASR systems used to reduce WER on unseen speakers. In this work, we explore speaker adaptation of lip-reading models using latent identity vectors (visual i-vectors) obtained by factor analysis on visual features. In order to estimate the visual i-vectors, we employ two ways to collect sufficient statistics: first using GMM based universal background model (UBM) and second using RNN-HMM based UBM. The speaker-specific visual i-vector is given as an additional input to the hidden layers of the lip-reading model during train and test phases. On GRID corpus, use of visual i-vectors results in 15% and 10% relative improvements over current state of the art lip-reading architectures on unseen speakers using RNN-HMM and GMM based methods respectively. Furthermore, we explore the variation of WER with dimension of visual i-vectors, and with the amount of unseen speaker data required for visual i-vector estimation. We also report the results on Korean visual corpus that we created.


 DOI: 10.21437/Interspeech.2019-3237

Cite as: Kandala, P.A., Thanda, A., Margam, D.K., Aralikatti, R.C., Sharma, T., Roy, S., Venkatesan, S.M. (2019) Speaker Adaptation for Lip-Reading Using Visual Identity Vectors. Proc. Interspeech 2019, 2758-2762, DOI: 10.21437/Interspeech.2019-3237.


@inproceedings{Kandala2019,
  author={Pujitha Appan Kandala and Abhinav Thanda and Dilip Kumar Margam and Rohith Chandrashekar Aralikatti and Tanay Sharma and Sharad Roy and Shankar M. Venkatesan},
  title={{Speaker Adaptation for Lip-Reading Using Visual Identity Vectors}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2758--2762},
  doi={10.21437/Interspeech.2019-3237},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3237}
}