8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


A Visual Context-Aware Multimodal System for Spoken Language Processing

Niloy Mukherjee, Deb Roy

Massachusetts Institute of Technology, USA

Recent psycholinguistic experiments show that acoustic and syntactic aspects of online speech processing are influenced by visual context through cross-modal influences. During interpretation of speech, visual context seems to steer speech processing and vice versa. We present a real-time multimodal system motivated by these findings that performs early integration of visual contextual information to recognize the most likely word sequences in spoken language utterances. The system first acquires a grammar and a visually grounded lexicon from a "show-and-tell" procedure where the training input consists of camera images consisting of sets of objects paired with verbal object descriptions. Given a new scene, the system generates a dynamic visually-grounded language model and drives a dynamic model of visual attention to steer speech recognition search paths towards more likely word sequences.

Full Paper

Bibliographic reference.  Mukherjee, Niloy / Roy, Deb (2003): "A visual context-aware multimodal system for spoken language processing", In EUROSPEECH-2003, 2273-2276.