In this paper, we attempt to represent audio as a sequence of acoustic units using unsupervised learning and use them for multi-class classification. We expect the acoustic units to represent sounds or sound sequences to automatically create a sound alphabet. We use audio from multi-class Youtube-quality multimedia data to converge on a set of sound units, such that each audio file is represented as a sequence of these units. We then try to learn category language models over sequences of the acoustic units, and use them to generate acoustic and language model scores for each category. Finally, we use a margin based classification algorithm to weight the category scores to predict the class that each test data point belongs to. We compare different settings and report encouraging results on this task.
Bibliographic reference. Chaudhuri, Sourish / Harvilla, Mark / Raj, Bhiksha (2011): "Unsupervised learning of acoustic unit descriptors for audio content representation and classification", In INTERSPEECH-2011, 2265-2268.