7th International Conference on Spoken Language Processing

September 16-20, 2002
Denver, Colorado, USA

A Trainable Spoken Language Understanding System for Visual Object Selection

Deb Roy, Peter Gorniak, Niloy Mukherjee, Josh Juster

Massachusetts Institute of Technology, USA

We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions. The system is embodied in a table-top mounted active vision platform. During training, a set of objects is placed in front of the vision system. Using a laser pointer, the system points to objects in random sequence, prompting a human teacher to provide spoken descriptions of the selected objects. The descriptions are transcribed and used to automatically acquire a visuallygrounded vocabulary and grammar. Once trained, a person can interact with the system by verbally describing objects placed in front of the system. The system recognizes and robustly parses the speech and points, in real-time, to the object which best fits the visual semantics of the spoken description.

Full Paper

Bibliographic reference.  Roy, Deb / Gorniak, Peter / Mukherjee, Niloy / Juster, Josh (2002): "A trainable spoken language understanding system for visual object selection", In ICSLP-2002, 593-596.