ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

FANS: Fusing ASR and NLU for On-Device SLU

Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow

Spoken language understanding (SLU) systems translate voice input commands to semantics which are encoded as an intent and pairs of slot tags and values. Most current SLU systems deploy a cascade of two neural models where the first one maps the input audio to a transcript (ASR) and the second predicts the intent and slots from the transcript (NLU). In this paper, we introduce FANS, a new end-to-end SLU model that fuses an ASR audio encoder to a multi-task NLU decoder to infer the intent, slot tags, and slot values directly from a given input audio, obviating the need for transcription. FANS consists of a shared audio encoder and three decoders, two of which are seq-to-seq decoders that predict non null slot tags and slot values in parallel and in an auto-regressive manner. FANS neural encoder and decoders architectures are flexible which allows us to leverage different combinations of LSTM, self-attention, and attenders. Our experiments show compared to the state-of-the-art end-to-end SLU models, FANS reduces ICER and IRER errors relatively by 30% and 7%, respectively, when tested on an in-house SLU dataset and by 0.86% and 2% absolute when tested on a public SLU dataset.


doi: 10.21437/Interspeech.2021-793

Cite as: Radfar, M., Mouchtaris, A., Kunzmann, S., Rastrow, A. (2021) FANS: Fusing ASR and NLU for On-Device SLU. Proc. Interspeech 2021, 1224-1228, doi: 10.21437/Interspeech.2021-793

@inproceedings{radfar21_interspeech,
  author={Martin Radfar and Athanasios Mouchtaris and Siegfried Kunzmann and Ariya Rastrow},
  title={{FANS: Fusing ASR and NLU for On-Device SLU}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1224--1228},
  doi={10.21437/Interspeech.2021-793}
}