Improving Speaker-Independent Lipreading with Domain-Adversarial Training

Michael Wand, Jürgen Schmidhuber


We present a Lipreading system, i.e. a speech recognition system using only visual features, which uses domain-adversarial training for speaker independence. Domain-adversarial training is integrated into the optimization of a lipreader based on a stack of feedforward and LSTM (Long Short-Term Memory) recurrent neural networks, yielding an end-to-end trainable system which only requires a very small number of frames of untranscribed target data to substantially improve the recognition accuracy on the target speaker. On pairs of different source and target speakers, we achieve a relative accuracy improvement of around 40% with only 15 to 20 seconds of untranscribed target speech data. On multi-speaker training setups, the accuracy improvements are smaller but still substantial.


 DOI: 10.21437/Interspeech.2017-421

Cite as: Wand, M., Schmidhuber, J. (2017) Improving Speaker-Independent Lipreading with Domain-Adversarial Training. Proc. Interspeech 2017, 3662-3666, DOI: 10.21437/Interspeech.2017-421.


@inproceedings{Wand2017,
  author={Michael Wand and Jürgen Schmidhuber},
  title={Improving Speaker-Independent Lipreading with Domain-Adversarial Training},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3662--3666},
  doi={10.21437/Interspeech.2017-421},
  url={http://dx.doi.org/10.21437/Interspeech.2017-421}
}