High-level multimedia event detection aims to identify videos containing a target event. Recent approaches leveraging audio information for this task fall into two broad categories. The first corresponds to holistic bag-of-words approaches based on framelevel descriptors. These are effective for classification, but hard for humans to interpret. The second corresponds to approaches that build a limited set of mid-level concept detectors trained using large amounts of annotated data. Such approaches do not scale easily for large scale tasks with heterogeneous data. We explore using audio Self Organized Units (SOU) to capture mid-level segmental information in a completely unsupervised fashion, and devise various features based on the SOU decoding process on each video. We train BBN's speech SOU system on unannotated web audio data. A multi-pass adaptive decoder from the BBN speech recognition system is engaged to decode audio data using the HMM-based audio SOUs. We devise various vector representations from the audio SOU lattices and from the constrained maximum likelihood linear regression adaptation matrices at different stages of the decoding. High-level event detection using these representations shows promising results on the benchmark 2011 TRECVID Multimedia Event Detection dataset. Furthermore, the audio SOUs provide potential for human interpretable features.
Bibliographic reference. Zhuang, Xiaodan / Wu, Shuang / Natarajan, Pradeep / Prasad, Rohit / Natarajan, Prem (2013): "Audio self organized units for high-level event detection", In INTERSPEECH-2013, 2953-2957.