10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Improving Detection of Acoustic Events Using Audiovisual Data and Feature Level Fusion

T. Butko, C. Canton-Ferrer, C. Segura, X. Giró, C. Nadeu, J. Hernando, J. R. Casas

Universitat Politècnica de Catalunya, Spain

The detection of the acoustic events (AEs) that are naturally produced in a meeting room may help to describe the human and social activity that takes place in it. When applied to spontaneous recordings, the detection of AEs from only audio information shows a large amount of errors, which are mostly due to temporal overlapping of sounds. In this paper, a system to detect and recognize AEs using both audio and video information is presented. A feature-level fusion strategy is used, and the structure of the HMM-GMM based system considers each class separately and uses a one-against-all strategy for training. Experimental AED results with a new and rather spontaneous dataset are presented which show the advantage of the proposed approach.

Full Paper

Bibliographic reference.  Butko, T. / Canton-Ferrer, C. / Segura, C. / Giró, X. / Nadeu, C. / Hernando, J. / Casas, J. R. (2009): "Improving detection of acoustic events using audiovisual data and feature level fusion", In INTERSPEECH-2009, 1147-1150.