The ever increasing volume of consumer domain videos on the Internet has led to a surge in interest in automatically analyzing such content. The audio signal in these videos contains salient information, but applying current automatic speech recognition (ASR) techniques is not viable due to high variability, noise and multilingual content. We present two unsupervised techniques which do not rely on ASR to address these challenges. The first method involves learning an unsupervised codebook by clustering audio features, and the second involves directly matching low-level features using the pyramid match kernel (PMK). Experimental results on a .200 hour audio corpus downloaded from YouTube show that both our approaches significantly outperform the traditional approach of first segmenting the audio stream to a set of mid-level classes (e.g. speech, non-speech, music, silence) and using the duration statistics of these classes to train high-level classifiers.
Bibliographic reference. Natarajan, Pradeep / Tsakalidis, Stavros / Manohar, Vasant / Prasad, Rohit / Natarajan, Premkumar (2011): "Unsupervised audio analysis for categorizing heterogeneous consumer domain videos", In INTERSPEECH-2011, 313-316.