In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound recognition performance with the self attention induced SED model. We extend the proposed idea with a multi-head self attention mechanism where each attention head processes the audio embedding with explicit attention width values. The proposed use of memory controlled sequential self attention offers a way to induce relations among frames of sound event tokens. We show that our memory controlled self attention model achieves an event based F-score of 33.92% on the URBAN-SED dataset, outperforming the F-score of 20.10% reported by the model without self attention.
Cite as: Pankajakshan, A., Bear, H.L., Subramanian, V., Benetos, E. (2020) Memory Controlled Sequential Self Attention for Sound Recognition. Proc. Interspeech 2020, 831-835, doi: 10.21437/Interspeech.2020-1953
@inproceedings{pankajakshan20_interspeech, author={Arjun Pankajakshan and Helen L. Bear and Vinod Subramanian and Emmanouil Benetos}, title={{Memory Controlled Sequential Self Attention for Sound Recognition}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={831--835}, doi={10.21437/Interspeech.2020-1953} }