Non-speech acoustic events are significantly different between them, and usually require access to detail rich features. That is why directly modeling a real spectrogram can provide a significant advantage, instead of using predefined features that usually compress and downsample detail as typically done in speech recognition. This paper focuses on the importance of feature extraction for deep learning based acoustic event detection, and more specifically on exploiting local spectro-temporal features of sounds. We do this in two ways: (1) outside the model, using multiple resolution spectrogram simultaneously based on the fact that there is a time-frequency detail trade-off that depends on the resolution with which a spectrogram is computed (e.g. `steps' would require a finer time resolution, while sounds that span many frequencies require finer frequency detail); and (2), with a model that implicitly exploits locality, convolutional neural networks, which are a state-of-the-art 2D feature extraction model. An experimental evaluation shows that the presented approaches outperform state-of-the-art deep learning baseline with a noticeable gain in the CNN case, and provides insights regarding CNN-based spectrogram characterization.
Bibliographic reference. Espi, Miquel / Fujimoto, Masakiyo / Kinoshita, Keisuke / Nakatani, Tomohiro (2015): "Feature extraction strategies in deep learning based acoustic event detection", In INTERSPEECH-2015, 2922-2926.