The detection of the acoustic events (AEs) that are naturally produced in a meeting room may help to describe the human and social activity that takes place in it. When applied to spontaneous recordings, the detection of AEs from only audio information shows a large amount of errors, which are mostly due to temporal overlapping of sounds. In this paper, a system to detect and recognize AEs using both audio and video information is presented. A feature-level fusion strategy is used, and the structure of the HMM-GMM based system considers each class separately and uses a one-against-all strategy for training. Experimental AED results with a new and rather spontaneous dataset are presented which show the advantage of the proposed approach.
Bibliographic reference. Butko, T. / Canton-Ferrer, C. / Segura, C. / Giró, X. / Nadeu, C. / Hernando, J. / Casas, J. R. (2009): "Improving detection of acoustic events using audiovisual data and feature level fusion", In INTERSPEECH-2009, 1147-1150.