Temporal Transformer Networks for Acoustic Scene Classification

Teng Zhang, Kailai Zhang, Ji Wu


Neural networks have been proven to be powerful models for acoustic scene classification tasks, but are still limited by the lack of ability to be temporally invariant to the audio data. In this paper, a novel temporal transformer module is proposed to allow the temporal manipulation of data in neural networks. This module is composed of a Fourier transform layer for feature maps and a learnable feature reduction layer and can be inserted into existing convolutional neural network (CNN) and Long short-term memory (LSTM) models. Experiments on LITIS Rouen dataset and DCASE2016 dataset show that the proposed method leads to a significant improvement when compared with the existing neural networks. Our approach is able to perform significantly better than the state-of-the-art result on LITIS Rouen dataset, obtaining a relative reduction of 23.6% on classification error.


 DOI: 10.21437/Interspeech.2018-1152

Cite as: Zhang, T., Zhang, K., Wu, J. (2018) Temporal Transformer Networks for Acoustic Scene Classification. Proc. Interspeech 2018, 1349-1353, DOI: 10.21437/Interspeech.2018-1152.


@inproceedings{Zhang2018,
  author={Teng Zhang and Kailai Zhang and Ji Wu},
  title={Temporal Transformer Networks for Acoustic Scene Classification},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1349--1353},
  doi={10.21437/Interspeech.2018-1152},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1152}
}