Deep Convolutional Neural Network with Scalogram for Audio Scene Modeling

Hangting Chen, Pengyuan Zhang, Haichuan Bai, Qingsheng Yuan, Xiuguo Bao, Yonghong Yan


Deep learning has improved the performance of acoustic scene classification recently. However, learning is usually based on short-time Fourier transform and hand-tailored filters. Learning directly from raw signals has remained a big challenge. In this paper, we proposed an approach to learning audio scene patterns from scalogram, which is extracted from raw signal with simple wavelet transforms. The experiments were conducted on DCASE2016 dataset. We compared scalogram with classical Mel energy, which showed that multi-scale feature led to an obvious accuracy increase. The convolutional neural network integrated with maximum-average downsampled scalogram achieved an accuracy of 90.5% in the evaluation step in DCASE2016.


 DOI: 10.21437/Interspeech.2018-1524

Cite as: Chen, H., Zhang, P., Bai, H., Yuan, Q., Bao, X., Yan, Y. (2018) Deep Convolutional Neural Network with Scalogram for Audio Scene Modeling. Proc. Interspeech 2018, 3304-3308, DOI: 10.21437/Interspeech.2018-1524.


@inproceedings{Chen2018,
  author={Hangting Chen and Pengyuan Zhang and Haichuan Bai and Qingsheng Yuan and Xiuguo Bao and Yonghong Yan},
  title={Deep Convolutional Neural Network with Scalogram for Audio Scene Modeling},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3304--3308},
  doi={10.21437/Interspeech.2018-1524},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1524}
}