A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

Yun Wang, Florian Metze


Sound event detection is the task of detecting the type, onset time, and offset time of sound events in audio streams. The mainstream solution is recurrent neural networks (RNNs), which usually predict the probability of each sound event at every time step. Connectionist temporal classification (CTC) has been applied in order to relax the need for exact annotations of onset and offset times; the CTC output layer is expected to generate a peak for each event boundary where the acoustic signal is most salient. However, with limited training data, the CTC network has been found to train slowly, and generalize poorly to new data.

In this paper, we try to introduce knowledge learned from a much larger corpus into the CTC network. We train two variants of SoundNet, a deep convolutional network that takes the audio tracks of videos as input, and tries to approximate the visual information extracted by an image recognition network. A lower part of SoundNet or its variants is then used as a feature extractor for the CTC network to perform sound event detection. We show that the new feature extractor greatly accelerates the convergence of the CTC network, and slightly improves the generalization.


 DOI: 10.21437/Interspeech.2017-1469

Cite as: Wang, Y., Metze, F. (2017) A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification. Proc. Interspeech 2017, 3097-3101, DOI: 10.21437/Interspeech.2017-1469.


@inproceedings{Wang2017,
  author={Yun Wang and Florian Metze},
  title={A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3097--3101},
  doi={10.21437/Interspeech.2017-1469},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1469}
}