ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

Joint interpretation of input speech and pen gestures for multimodal human-computer interaction

Pui-Yu Hui, Helen M. Meng

This paper describes out initial work in semantic interpretation of multimodal user input that consist of speech and pen gestures. We have designed and collected a multimodal corpus of over a thousand navigational inquiries around the Beijing area. We devised a processing sequence for extracting spoken references from the speech input (perfect transcripts) and interpreting each reference by generating a hypothesis list of possible semantics (i.e. locations). We also devised a processing sequence for interpreting pen gestures (pointing, circling and strokes) and generating a hypothesis list for every gesture. Partial interpretations from individual modalities are combined using Viterbi alignment, which enforces the constraints of temporal order and semantic compatibility constraints in its cost functions to generate an integrated interpretation across modalities for overall input. This approach can correctly interpret over 97% of the 322 multimodal inquiries in our test set.


doi: 10.21437/Interspeech.2006-362

Cite as: Hui, P.-Y., Meng, H.M. (2006) Joint interpretation of input speech and pen gestures for multimodal human-computer interaction. Proc. Interspeech 2006, paper 1834-Tue2CaP.13, doi: 10.21437/Interspeech.2006-362

@inproceedings{hui06_interspeech,
  author={Pui-Yu Hui and Helen M. Meng},
  title={{Joint interpretation of input speech and pen gestures for multimodal human-computer interaction}},
  year=2006,
  booktitle={Proc. Interspeech 2006},
  pages={paper 1834-Tue2CaP.13},
  doi={10.21437/Interspeech.2006-362}
}