A Deep Residual Network for Large-Scale Acoustic Scene Analysis

Logan Ford, Hao Tang, Fran├žois Grondin, James Glass


Many of the recent advances in audio event detection, particularly on the AudioSet data set, have focused on improving performance using the released embeddings produced by a pre-trained model. In this work, we instead study the task of training a multi-label event classifier directly from the audio recordings of AudioSet. Using the audio recordings, not only are we able to reproduce results from prior work, we have also confirmed improvements of other proposed additions, such as an attention module. Moreover, by training the embedding network jointly with the additions, we achieve an mAP of 0.392 and an AUC of 0.971, surpassing the state of the art without transfer learning from a large data set. We also analyze the output activations of the network and find that the models are able to localize audio events when a finer time resolution is needed.


 DOI: 10.21437/Interspeech.2019-2731

Cite as: Ford, L., Tang, H., Grondin, F., Glass, J. (2019) A Deep Residual Network for Large-Scale Acoustic Scene Analysis. Proc. Interspeech 2019, 2568-2572, DOI: 10.21437/Interspeech.2019-2731.


@inproceedings{Ford2019,
  author={Logan Ford and Hao Tang and Fran├žois Grondin and James Glass},
  title={{A Deep Residual Network for Large-Scale Acoustic Scene Analysis}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2568--2572},
  doi={10.21437/Interspeech.2019-2731},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2731}
}