Audio event detection on Google's Audio Set database: Preliminary results using different types of DNNs

Javier Darna-Sequeiros, Doroteo T. Toledano


This paper focuses on the audio event detection problem, in particular on Google Audio Set, a database published in 2017 whose size and breadth are unprecedented for this problem. In order to explore the possibilities of this dataset, several classifiers based on different types of deep neural networks were designed, implemented and evaluated to check the impact of factors such as the architecture of the network, the number of layers and the codification of the data in the performance of the models. From all the classifiers tested, the LSTM neural network showed the best results with a mean average precision of 0.26652 and a mean recall of 0.30698. This result is particularly relevant since we use the embeddings provided by Google as input to the DNNs, which are sequences of at most 10 feature vectors and therefore limit the sequence modelling capabilities of LSTMs.


 DOI: 10.21437/IberSPEECH.2018-14

Cite as: Darna-Sequeiros, J., T. Toledano, D. (2018) Audio event detection on Google's Audio Set database: Preliminary results using different types of DNNs. Proc. IberSPEECH 2018, 64-67, DOI: 10.21437/IberSPEECH.2018-14.


@inproceedings{Darna-Sequeiros2018,
  author={Javier Darna-Sequeiros and Doroteo {T. Toledano}},
  title={{Audio event detection on Google's Audio Set database: Preliminary results using different types of DNNs}},
  year=2018,
  booktitle={Proc. IberSPEECH 2018},
  pages={64--67},
  doi={10.21437/IberSPEECH.2018-14},
  url={http://dx.doi.org/10.21437/IberSPEECH.2018-14}
}