5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Learning Words from Natural Audio-Visual Input

Deb Roy, Alex Pentland

MIT Media Laboratory, USA

We present a computational model of sensory-grounded language acquisition. Words are learned from naturally spoken multiword utterances paired with color images of objects. Speech recognition and computer vision algorithms are used to build representations of the input speech and images. Words are learned by first clustering images along shape and color dimensions. A search algorithm then finds speech segments within the continuous multiword input speech which co-occur with each visual cluster. The learned words can be used in a speech understanding task to request images based on spoken descriptions and in a speech generation task to automatically generate spoken descriptions of images. Although simple in its current form, this model is a first step towards a more complete, fully-grounded model of language acquisition. Practical applications include adaptive human-machine interfaces based on spoken language for information browsing, assistive technologies, education, and entertainment.

Full Paper

Bibliographic reference.  Roy, Deb / Pentland, Alex (1998): "Learning words from natural audio-visual input", In ICSLP-1998, paper 0551.