Multi-modal Attention Mechanisms in LSTM and Its Application to Acoustic Scene Classification

Teng Zhang, Kailai Zhang, Ji Wu


Neural network architectures such as long short-term memory (LSTM) have been proven to be powerful models for processing sequences including text, audio and video. On the basis of vanilla LSTM, multi-modal attention mechanisms are proposed in this paper to synthesize the time and semantic information of input sequences. First, we reconstruct the forget and input gates of the LSTM unit from the perspective of attention model in the temporal dimension. Then the memory content of the LSTM unit is recalculated using a cluster-based attention mechanism in semantic space. Experiments on acoustic scene classification tasks show performance improvements of the proposed methods when compared with vanilla LSTM. The classification errors on LITIS ROUEN dataset and DCASE2016 dataset are reduced by 16.5% and 7.7% relatively. We get a second place in the Kaggle's YouTube-8M video understanding challenge and multi-modal attention based LSTM model is one of our best-performing single systems.


 DOI: 10.21437/Interspeech.2018-1138

Cite as: Zhang, T., Zhang, K., Wu, J. (2018) Multi-modal Attention Mechanisms in LSTM and Its Application to Acoustic Scene Classification. Proc. Interspeech 2018, 3328-3332, DOI: 10.21437/Interspeech.2018-1138.


@inproceedings{Zhang2018,
  author={Teng Zhang and Kailai Zhang and Ji Wu},
  title={Multi-modal Attention Mechanisms in LSTM and Its Application to Acoustic Scene Classification},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3328--3332},
  doi={10.21437/Interspeech.2018-1138},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1138}
}