Enabling smart devices to infer about the environment using audio signals has been one of the several long-standing challenges in machine listening. The availability of public-domain datasets, e.g., Detection and Classification of Acoustic Scenes and Events (DCASE) 2016, enabled researchers to compare various algorithms on standard predefined tasks. Most of the current best performing individual acoustic scene classification systems utilize different spectrogram image based features with a Convolutional Neural Network (CNN) architecture. In this study, we first analyze the performance of a state-of-the-art CNN system for different auditory image and spectrogram features, including Mel-scaled, logarithmically scaled, linearly scaled filterbank spectrograms, and Stabilized Auditory Image (SAI) features. Next, we benchmark an MFCC based Gaussian Mixture Model (GMM) SuperVector (SV) system for acoustic scene classification. Finally, we utilize the activations from the final layer of the CNN to form a SuperVector (SV) and use them as feature vectors for a Probabilistic Linear Discriminative Analysis (PLDA) classifier. Experimental evaluation on the DCASE 2016 database demonstrates the effectiveness of the proposed CNN-SV approach compared to conventional CNNs with a fully connected softmax output layer. Score fusion of individual systems provides up to 7% relative improvement in overall accuracy compared to the CNN baseline system.
Cite as: Hyder, R., Ghaffarzadegan, S., Feng, Z., Hansen, J.H.L., Hasan, T. (2017) Acoustic Scene Classification Using a CNN-SuperVector System Trained with Auditory and Spectrogram Image Features. Proc. Interspeech 2017, 3073-3077, doi: 10.21437/Interspeech.2017-431
@inproceedings{hyder17_interspeech, author={Rakib Hyder and Shabnam Ghaffarzadegan and Zhe Feng and John H.L. Hansen and Taufiq Hasan}, title={{Acoustic Scene Classification Using a CNN-SuperVector System Trained with Auditory and Spectrogram Image Features}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3073--3077}, doi={10.21437/Interspeech.2017-431} }