In this paper, we propose an innovative integrated approach to leveraging available spoken
content while detecting events in consumergenerated
multimedia data (i.e., YouTube videos). Spoken content in consumer videos exhibits several
challenges. For example, unlike Broadcast
News, the spoken audio is typically not labeled. Also, the audio track in consumer videos
tends to be noisy and the spoken content is often
Here, we describe three recent improvements that are specifically targeted at overcoming the challenges in consumer videos: robust data-driven keyword selection, automatic discovery of word-classes for keyword expansion, and a keyword spotting approach for improving recall in noisy conditions. These improvements were integrated into the audio analysis component of the BBN VISER system that demonstrated top performance in the 2011 TRECVID Multimedia Event Detection (MED) task. Experimental results on the 2011 TRECVID MED task clearly demonstrate the effectiveness of the three improvements.
Index Terms: multimedia event detection, keyword selection, keyword expansion, keyword spotting.
Bibliographic reference. Tsakalidis, Stavros / Zhuang, Xiaodan / Hsiao, Roger / Wu, Shuang / Natarajan, Pradeep / Prasad, Rohit / Natarajan, Prem (2012): "Robust event detection from spoken content in consumer domain videos", In INTERSPEECH-2012, 2101-2104.