Acoustic Scene Classification with Mismatched Devices Using CliqueNets and Mixup Data Augmentation

Truc Nguyen, Franz Pernkopf


Deep learning (DL) is key for the recent boost of acoustic scene classification (ASC) performance. Especially, convolutional neural networks (CNNs) are widely adopted with affirmed success. However, models are large and cumbersome, i.e. they have many layers, parallel branches or large ensemble of individual models. In this paper, we propose a resource-efficient model using CliqueNets for feature learning and a mixture-of-experts (MoEs) layer. CliqueNets are a recurrent feedback structure enabling feature refinement by the alternate propagation between constructed loop layers. In addition, we use mixup data augmentation to construct adversarial training examples. It is used for balancing the dataset of DCASE 2018 task 1B over the recordings of the mismatched devices A, B and C. This prevents over-fitting on the dataset of Device A, caused by the gap of data amount between the different recording devices. Experimental results show that the proposed model achieves 64.7% average classification accuracy for Device C and B, and 70.0% for Device A with less than one million of parameters.


 DOI: 10.21437/Interspeech.2019-3002

Cite as: Nguyen, T., Pernkopf, F. (2019) Acoustic Scene Classification with Mismatched Devices Using CliqueNets and Mixup Data Augmentation. Proc. Interspeech 2019, 2330-2334, DOI: 10.21437/Interspeech.2019-3002.


@inproceedings{Nguyen2019,
  author={Truc Nguyen and Franz Pernkopf},
  title={{Acoustic Scene Classification with Mismatched Devices Using CliqueNets and Mixup Data Augmentation}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2330--2334},
  doi={10.21437/Interspeech.2019-3002},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3002}
}