We present a computational model of sensory-grounded language acquisition. Words are learned from naturally spoken multiword utterances paired with color images of objects. Speech recognition and computer vision algorithms are used to build representations of the input speech and images. Words are learned by first clustering images along shape and color dimensions. A search algorithm then finds speech segments within the continuous multiword input speech which co-occur with each visual cluster. The learned words can be used in a speech understanding task to request images based on spoken descriptions and in a speech generation task to automatically generate spoken descriptions of images. Although simple in its current form, this model is a first step towards a more complete, fully-grounded model of language acquisition. Practical applications include adaptive human-machine interfaces based on spoken language for information browsing, assistive technologies, education, and entertainment.
Cite as: Roy, D., Pentland, A. (1998) Learning words from natural audio-visual input. Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998), paper 0551, doi: 10.21437/ICSLP.1998-275
@inproceedings{roy98_icslp, author={Deb Roy and Alex Pentland}, title={{Learning words from natural audio-visual input}}, year=1998, booktitle={Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998)}, pages={paper 0551}, doi={10.21437/ICSLP.1998-275} }