9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Fusion of Audio and Video Modalities for Detection of Acoustic Events

Taras Butko, Andrey Temko, Climent Nadeu, Cristian Canton

Universitat Politècnica de Catalunya, Spain

Detection of acoustic events (AED) that take place in a meetingroom environment becomes a difficult task when signals show a large proportion of temporal overlap of sounds, like in seminartype data, where the acoustic events often occur simultaneously with speech. Whenever the event that produces the sound is related to a given position or movement, video signals may be a useful additional source of information for AED. In this work, we aim at improving the AED accuracy by using two complementary audio-based AED systems, built with SVM and HMM classifiers, and also a video-based AED system, which employs the output of a 3D video tracking algorithm to improve detection of steps. Fuzzy integral is used to fuse the outputs of the three classification systems in two stages. Experimental results using the CLEAR'07 evaluation data show that the detection rate increases by fusing the two audio information sources, and it is further improved by including video information.

