7th International Conference on Spoken Language Processing
September 16-20, 2002
We present a trainable, visually-grounded, spoken language understanding system. The system acquires a grammar and vocabulary from a "show-and-tell" procedure in which visual scenes are paired with verbal descriptions. The system is embodied in a table-top mounted active vision platform. During training, a set of objects is placed in front of the vision system. Using a laser pointer, the system points to objects in random sequence, prompting a human teacher to provide spoken descriptions of the selected objects. The descriptions are transcribed and used to automatically acquire a visuallygrounded vocabulary and grammar. Once trained, a person can interact with the system by verbally describing objects placed in front of the system. The system recognizes and robustly parses the speech and points, in real-time, to the object which best fits the visual semantics of the spoken description.
Bibliographic reference. Roy, Deb / Gorniak, Peter / Mukherjee, Niloy / Juster, Josh (2002): "A trainable spoken language understanding system for visual object selection", In ICSLP-2002, 593-596.