Visual Recognition of Continuous Cued Speech Using a Tandem CNN-HMM Approach

Li Liu, Thomas Hueber, Gang Feng, Denis Beautemps


This study addresses the problem of automatic recognition of Cued Speech (CS), a visual mode of communication for hearing impaired people in which a complete phonetic repertoire is obtained by combining lip movements with hand cues. In the proposed system, the dynamic of visual features extracted from lip and hand images using convolutional neural networks (CNN) are modeled by a set of hidden Markov models (HMM), for each phonetic context (tandem architecture). CNN-based feature extraction is compared to an unsupervised approach based on the principal component analysis. A novel temporal segmentation of hand streams is used to train CNNs efficiently. Different strategies for combining the extracted visual features within the HMM decoder are investigated. Experimental evaluation is carried on an audiovisual dataset (containing only continuous French sentences) recorded specifically for this study. In its best configuration and without exploiting any dictionary or language model, the proposed tandem CNN-HMM architecture is able to identify correctly more than 73% of the phoneme (62% when considering insertion errors).


 DOI: 10.21437/Interspeech.2018-2434

Cite as: Liu, L., Hueber, T., Feng, G., Beautemps, D. (2018) Visual Recognition of Continuous Cued Speech Using a Tandem CNN-HMM Approach. Proc. Interspeech 2018, 2643-2647, DOI: 10.21437/Interspeech.2018-2434.


@inproceedings{Liu2018,
  author={Li Liu and Thomas Hueber and Gang Feng and Denis Beautemps},
  title={Visual Recognition of Continuous Cued Speech Using a Tandem CNN-HMM Approach},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={2643--2647},
  doi={10.21437/Interspeech.2018-2434},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2434}
}