We present an audio-visual attention system for speech based interaction with a humanoid robot where a tutor can teach visual properties/locations (e.g “left”) and corresponding, arbitrary speech labels. The acoustic signal is segmented via the attention system and speech labels are learned from a few repetitions of the label by the tutor. The attention system integrates bottom-up stimulus driven saliency calculation (delay-and-sum beamforming, adaptive noise level estimation) and top-down modulation (spectral properties, segment length, movement and interaction status of the robot). We evaluate the performance of different aspects of the system based on a small dataset.
Bibliographic reference. Heckmann, Martin / Brandl, Holger / Domont, Xavier / Bolder, Bram / Joublin, Frank / Goerick, Christian (2009): "An audio-visual attention system for online association learning", In INTERSPEECH-2009, 2171-2174.