Sound event detection is the task of detecting the type, onset time,
and offset time of sound events in audio streams. The mainstream solution
is recurrent neural networks (RNNs), which usually predict the probability
of each sound event at every time step. Connectionist temporal classification
(CTC) has been applied in order to relax the need for exact annotations
of onset and offset times; the CTC output layer is expected to generate
a peak for each event boundary where the acoustic signal is most salient.
However, with limited training data, the CTC network has been found
to train slowly, and generalize poorly to new data.
In this paper, we
try to introduce knowledge learned from a much larger corpus into the
CTC network. We train two variants of SoundNet, a deep convolutional
network that takes the audio tracks of videos as input, and tries to
approximate the visual information extracted by an image recognition
network. A lower part of SoundNet or its variants is then used as a
feature extractor for the CTC network to perform sound event detection.
We show that the new feature extractor greatly accelerates the convergence
of the CTC network, and slightly improves the generalization.
Cite as: Wang, Y., Metze, F. (2017) A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification. Proc. Interspeech 2017, 3097-3101, doi: 10.21437/Interspeech.2017-1469
@inproceedings{wang17k_interspeech, author={Yun Wang and Florian Metze}, title={{A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3097--3101}, doi={10.21437/Interspeech.2017-1469} }