Time Aggregation Operators for Multi-label Audio Event Detection

Pankaj Joshi, Digvijaysingh Gautam, Ganesh Ramakrishnan, Preethi Jyothi


In this paper, we present a state-of-the-art system for audio event detection. The labels on the training (and evaluation) data specify the set of events occurring in each audio clip, but neither the time spans nor the order in which they occur. Specifically, our task of weakly supervised learning is the “Detection and Classification of Acoustic Scenes and Events (DCASE) 2017” challenge. We use the winning entry in this challenge given by Xu et al. as our starting point and identify several important modifications that allow us to improve on their results significantly. Our techniques pertain to aggregation and consolidation over time and frequency signals over a (temporal) sequence before decoding the labels. In general, our work is also relevant to other tasks involving learning from weak labeling of sequential data.


 DOI: 10.21437/Interspeech.2018-1637

Cite as: Joshi, P., Gautam, D., Ramakrishnan, G., Jyothi, P. (2018) Time Aggregation Operators for Multi-label Audio Event Detection. Proc. Interspeech 2018, 3309-3313, DOI: 10.21437/Interspeech.2018-1637.


@inproceedings{Joshi2018,
  author={Pankaj Joshi and Digvijaysingh Gautam and Ganesh Ramakrishnan and Preethi Jyothi},
  title={Time Aggregation Operators for Multi-label Audio Event Detection},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3309--3313},
  doi={10.21437/Interspeech.2018-1637},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1637}
}