A Robust Framework for Acoustic Scene Classification

Lam Pham, Ian McLoughlin, Huy Phan, Ramaswamy Palaniappan

Acoustic scene classification (ASC) using front-end time-frequency features and back-end neural network classifiers has demonstrated good performance in recent years. However a profusion of systems has arisen to suit different tasks and datasets, utilising different feature and classifier types. This paper aims at a robust framework that can explore and utilise a range of different time-frequency features and neural networks, either singly or merged, to achieve good classification performance. In particular, we exploit three different types of front-end time-frequency feature; log energy Mel filter, Gammatone filter and constant Q transform. At the back-end we evaluate effective a two-stage model that exploits a Convolutional Neural Network for pre-trained feature extraction, followed by Deep Neural Network classifiers as a post-trained feature adaptation model and classifier. We also explore the use of a data augmentation technique for these features that effectively generates a variety of intermediate data, reinforcing model learning abilities, particularly for marginal cases. We assess performance on the DCASE2016 dataset, demonstrating good classification accuracies exceeding 90%, significantly outperforming the DCASE2016 baseline and highly competitive compared to state-of-the-art systems.

 DOI: 10.21437/Interspeech.2019-1841

Cite as: Pham, L., McLoughlin, I., Phan, H., Palaniappan, R. (2019) A Robust Framework for Acoustic Scene Classification. Proc. Interspeech 2019, 3634-3638, DOI: 10.21437/Interspeech.2019-1841.

  author={Lam Pham and Ian McLoughlin and Huy Phan and Ramaswamy Palaniappan},
  title={{A Robust Framework for Acoustic Scene Classification}},
  booktitle={Proc. Interspeech 2019},