An audio-visual localisation and tracking system for meeting scenarios is presented which draws its inspiration from neurobiological processing. Meetings are recorded by a KEMAR binaural manikin and a single camera placed directly above the manikin. Source localisation from the binaural audio and face, object and motion locations from the video frames are used as input to two linked neural oscillator networks. The strength of the connections between the two networks determines the mapping between activity at a particular audio azimuth and activity at a particular visual frame column. A Hebbian learning rule is used to establish the connection strengths. The combined network segments the video and audio features and then produces audio-visual groupings on the basis of common spatial location. The audio-visual groupings are tracked through time using a mechanism based upon that of the human oculomotor system which incorporates smooth pursuit and saccadic movement.
Cite as: Wrigley, S.N., Brown, G.J. (2005) Physiologically motivated audio-visual localisation and tracking. Proc. Interspeech 2005, 773-776, doi: 10.21437/Interspeech.2005-360
@inproceedings{wrigley05_interspeech, author={Stuart N. Wrigley and Guy J. Brown}, title={{Physiologically motivated audio-visual localisation and tracking}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={773--776}, doi={10.21437/Interspeech.2005-360} }