INTERSPEECH 2007
8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Speech Recognition Techniques for a Sign Language Recognition System

Philippe Dreuw, David Rybach, Thomas Deselaers, Morteza Zahedi, Hermann Ney

RWTH Aachen University, Germany

One of the most significant differences between automatic sign language recognition (ASLR) and automatic speech recognition (ASR) is due to the computer vision problems, whereas the corresponding problems in speech signal processing have been solved due to intensive research in the last 30 years. We present our approach where we start from a large vocabulary speech recognition system to profit from the insights that have been obtained in ASR research.

The system developed is able to recognize sentences of continuous sign language independent of the speaker. The features used are obtained from standard video cameras without any special data acquisition devices. In particular, we focus on feature and model combination techniques applied in ASR, and the usage of pronunciation and language models (LM) in sign language. These techniques can be used for all kind of sign language recognition systems, and for many video analysis problems where the temporal context is important, e.g. for action or gesture recognition.

On a publicly available benchmark database consisting of 201 sentences and 3 signers, we can achieve a 17% WER.

Full Paper

Audio-Visual Examples

correct-example.avi This video shows a completely correctly recognized American sign language sentence. The top left picture shows the input video frames already overlaid with the optimal dominant-hand path determined by the tracking module. The top middle image shows the last 5 dominant-hand tracking positions and the resulting trajectory. The top right image shows the corresponding velocities in horizontal and vertical direction over time. These features, i.e. the original downscaled intensity image, the dominant-hand position, velocity, and its trajectory, were extracted to train and test the system. The recognized output of the system is shown in the first line, the bottom line shows the reference annotation of the sentence to be recognized.
error-example-with-pointing.avi  This video shows a recognized American sign language sentence which contains a substitution error shown in red, and two deletion errors shown as blanks. A special kind of gestures which occur in sign language are spatial references in the virtual signing space, denoted by 'IX' here. The reference to that location is not yet decoded, it is just handled as a pointing event 'IX'. The top left picture shows the input video frames already overlaid with the optimal dominant-hand path determined by the tracking module. The top middle image shows the last 5 dominant-hand tracking positions and the resulting trajectory. The top right image shows the corresponding velocities in horizontal and vertical direction over time. These features, i.e. the original downscaled intensity image, the dominant-hand position, velocity, and its trajectory, were extracted to train and test the system. The recognized output of the system is shown in the first line, and shows the substitution of the word 'GIVE' by the word 'WOMAN', and the two deletions of the words 'MAN' and 'IX', the bottom line shows the reference annotation of the sentence to be recognized.

Bibliographic reference.  Dreuw, Philippe / Rybach, David / Deselaers, Thomas / Zahedi, Morteza / Ney, Hermann (2007): "Speech recognition techniques for a sign language recognition system", In INTERSPEECH-2007, 2513-2516.