Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy. It outperforms all competing systems on all datasets with statistical significant differences in performance.
Cite as: Schymura, C., Bönninghoff, B., Ochiai, T., Delcroix, M., Kinoshita, K., Nakatani, T., Araki, S., Kolossa, D. (2021) PILOT: Introducing Transformers for Probabilistic Sound Event Localization. Proc. Interspeech 2021, 2117-2121, doi: 10.21437/Interspeech.2021-124
@inproceedings{schymura21_interspeech, author={Christopher Schymura and Benedikt Bönninghoff and Tsubasa Ochiai and Marc Delcroix and Keisuke Kinoshita and Tomohiro Nakatani and Shoko Araki and Dorothea Kolossa}, title={{PILOT: Introducing Transformers for Probabilistic Sound Event Localization}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={2117--2121}, doi={10.21437/Interspeech.2021-124} }