Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation

Zhiying Huang, Shiliang Zhang, Ming Lei


Audio tagging aims to identify the presence or absence of audio events in the audio clip. Recently, a lot of researchers have paid attention to explore different model structures to improve the performance of audio tagging. Convolutional neural network (CNN) is the most popular choice among a wide variety of model structures, and it’s successfully applied to audio events prediction task. However, the model complexity of CNN is relatively high, which is not efficient enough to ship in real product. In this paper, compact Feedforward Sequential Memory Network (cFSMN) is proposed for audio tagging task. Experimental results show that cFSMN-based system yields a comparable performance with the CNN-based system. Meanwhile, an audio-to-audio ratio (AAR) based data augmentation method is proposed to further improve the classifier performance. Finally, with raw waveforms of the balanced training set of Audio Set which is a published standard database, our system can achieve a state-of-the-art performance with AUC being 0.932. Moreover, cFSMN-based model has only 1.9 million parameters, which is only about 1/30 of the CNN-based model.


 DOI: 10.21437/Interspeech.2019-1302

Cite as: Huang, Z., Zhang, S., Lei, M. (2019) Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation. Proc. Interspeech 2019, 3377-3381, DOI: 10.21437/Interspeech.2019-1302.


@inproceedings{Huang2019,
  author={Zhiying Huang and Shiliang Zhang and Ming Lei},
  title={{Audio Tagging with Compact Feedforward Sequential Memory Network and Audio-to-Audio Ratio Based Data Augmentation}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3377--3381},
  doi={10.21437/Interspeech.2019-1302},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1302}
}