End-to-End Convolutional Sequence Learning for ASL Fingerspelling Recognition

Katerina Papadimitriou, Gerasimos Potamianos


Although fingerspelling is an often overlooked component of sign languages, it has great practical value in the communication of important context words that lack dedicated signs. In this paper we consider the problem of fingerspelling recognition in videos, introducing an end-to-end lexicon-free model that consists of a deep auto-encoder image feature learner followed by an attention-based encoder-decoder for prediction. The feature extractor is a vanilla auto-encoder variant, employing a quadratic activation function. The learned features are subsequently fed into the attention-based encoder-decoder. The latter deviates from traditional recurrent neural network architectures, being a fully convolutional attention-based encoder-decoder that is equipped with a multi-step attention mechanism relying on a quadratic alignment function and gated linear units over the convolution output. The introduced model is evaluated on the TTIC/UChicago fingerspelling video dataset, where it outperforms previous approaches in letter accuracy under all three, signer-dependent, -adapted, and -independent, experimental paradigms.


 DOI: 10.21437/Interspeech.2019-2422

Cite as: Papadimitriou, K., Potamianos, G. (2019) End-to-End Convolutional Sequence Learning for ASL Fingerspelling Recognition. Proc. Interspeech 2019, 2315-2319, DOI: 10.21437/Interspeech.2019-2422.


@inproceedings{Papadimitriou2019,
  author={Katerina Papadimitriou and Gerasimos Potamianos},
  title={{End-to-End Convolutional Sequence Learning for ASL Fingerspelling Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2315--2319},
  doi={10.21437/Interspeech.2019-2422},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2422}
}