Combining Residual Networks with LSTMs for Lipreading

Themos Stafylakis, Georgios Tzimiropoulos


We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.


 DOI: 10.21437/Interspeech.2017-85

Cite as: Stafylakis, T., Tzimiropoulos, G. (2017) Combining Residual Networks with LSTMs for Lipreading. Proc. Interspeech 2017, 3652-3656, DOI: 10.21437/Interspeech.2017-85.


@inproceedings{Stafylakis2017,
  author={Themos Stafylakis and Georgios Tzimiropoulos},
  title={Combining Residual Networks with LSTMs for Lipreading},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3652--3656},
  doi={10.21437/Interspeech.2017-85},
  url={http://dx.doi.org/10.21437/Interspeech.2017-85}
}