LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading

Leyuan Qu, Cornelius Weber, Stefan Wermter


Lip reading, also known as visual speech recognition, has recently received considerable attention. Although advanced feature engineering and powerful deep neural network architectures have been proposed for this task, the performance still cannot be competitive with speech recognition tasks using the audio modality as input. This is mainly because compared with audio, visual features carry less information relevant to word recognition. For example, the voiced sound made while the vocal cords vibrate can be represented by audio but is not reflected by mouth or lip movement. In this paper, we map the sequence of mouth movement images directly to mel-spectrogram to reconstruct the speech relevant information. Our proposed architecture consists of two components: (a) the mel-spectrogram reconstruction front-end which includes an encoder-decoder architecture with attention mechanism to predict mel-spectrogram from videos; (b) the lip reading back-end consisting of convolutional layers, bi-directional gated recurrent units, and connectionist temporal classification loss, which consumes the generated mel-spectrogram representation to predict text transcriptions. The speaker-dependent evaluation results demonstrate that our proposed model not only generates quality mel-spectrograms but also outperforms state-of-the-art models on the GRID benchmark lip reading dataset, with 0.843% character error rate and 2.525% word error rate.


 DOI: 10.21437/Interspeech.2019-1393

Cite as: Qu, L., Weber, C., Wermter, S. (2019) LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading. Proc. Interspeech 2019, 2768-2772, DOI: 10.21437/Interspeech.2019-1393.


@inproceedings{Qu2019,
  author={Leyuan Qu and Cornelius Weber and Stefan Wermter},
  title={{LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2768--2772},
  doi={10.21437/Interspeech.2019-1393},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1393}
}