How do we understand and interpret complex auditory environments in a way that may depend on some stated goals or intentions? Here, we propose a framework that provides a detailed analysis of the spectrotemporal modulations in the acoustic signal, augmented with a discriminative classifier using multilayer perceptrons. We show that such a representation is successful at capturing the non-trivial commonalties within a sound class and differences between different classes. It not only surpasses performance of current systems in the literature by about 21%, but proves quite robust for processing multi-source cases. In addition, we test the role of feature re-weighting in improving feature selectivity and signal-to-noise ratio in the direction of a sound class of interest.
Index Terms: scene understanding, acoustic event recognition, attention, bottom-up, top-down
Bibliographic reference. Patil, Kailash / Elhilali, Mounya (2012): "Goal-oriented auditory scene recognition", In INTERSPEECH-2012, 2510-2513.