We present an audio-visual attention system for speech based interaction with a humanoid robot where a tutor can teach visual properties/locations (e.g left) and corresponding, arbitrary speech labels. The acoustic signal is segmented via the attention system and speech labels are learned from a few repetitions of the label by the tutor. The attention system integrates bottom-up stimulus driven saliency calculation (delay-and-sum beamforming, adaptive noise level estimation) and top-down modulation (spectral properties, segment length, movement and interaction status of the robot). We evaluate the performance of different aspects of the system based on a small dataset.
Cite as: Heckmann, M., Brandl, H., Domont, X., Bolder, B., Joublin, F., Goerick, C. (2009) An audio-visual attention system for online association learning. Proc. Interspeech 2009, 2171-2174, doi: 10.21437/Interspeech.2009-619
@inproceedings{heckmann09_interspeech, author={Martin Heckmann and Holger Brandl and Xavier Domont and Bram Bolder and Frank Joublin and Christian Goerick}, title={{An audio-visual attention system for online association learning}}, year=2009, booktitle={Proc. Interspeech 2009}, pages={2171--2174}, doi={10.21437/Interspeech.2009-619} }