ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech

Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu

Cued Speech (CS) is a communication system for deaf people or hearing impaired people, in which a speaker uses it to aid a lipreader in phonetic level by clarifying potentially ambiguous mouth movements with hand shape and positions. Feature extraction of multi-modal CS is a key step in CS recognition. Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality. In this work, we first propose a self-supervised contrastive learning method to learn the feature representation of image without using labels. Secondly, a small amount of manually annotated CS data are used to fine-tune the first module. Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information. Besides, to enlarge the volume and the diversity of the current limited CS datasets, we build a new British English dataset containing 5 native CS speakers. Evaluation results on both French and British English datasets show that our model achieves over 90% accuracy in hand shape recognition. Significant improvements of 8.75% (for French) and 10.09% (for British English) are achieved in CS phoneme recognition correctness compared with the state-of-the-art.


doi: 10.21437/Interspeech.2021-440

Cite as: Wang, J., Gu, N., Yu, M., Li, X., Fang, Q., Liu, L. (2021) An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech. Proc. Interspeech 2021, 626-630, doi: 10.21437/Interspeech.2021-440

@inproceedings{wang21f_interspeech,
  author={Jianrong Wang and Nan Gu and Mei Yu and Xuewei Li and Qiang Fang and Li Liu},
  title={{An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={626--630},
  doi={10.21437/Interspeech.2021-440}
}