ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

PILOT: Introducing Transformers for Probabilistic Sound Event Localization

Christopher Schymura, Benedikt Bönninghoff, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa

Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy. It outperforms all competing systems on all datasets with statistical significant differences in performance.

doi: 10.21437/Interspeech.2021-124

Cite as: Schymura, C., Bönninghoff, B., Ochiai, T., Delcroix, M., Kinoshita, K., Nakatani, T., Araki, S., Kolossa, D. (2021) PILOT: Introducing Transformers for Probabilistic Sound Event Localization. Proc. Interspeech 2021, 2117-2121, doi: 10.21437/Interspeech.2021-124

  author={Christopher Schymura and Benedikt Bönninghoff and Tsubasa Ochiai and Marc Delcroix and Keisuke Kinoshita and Tomohiro Nakatani and Shoko Araki and Dorothea Kolossa},
  title={{PILOT: Introducing Transformers for Probabilistic Sound Event Localization}},
  booktitle={Proc. Interspeech 2021},