ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition

Niko Moritz, Takaaki Hori, Jonathan Le Roux

Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined with triggered attention, the proposed streaming end-to-end ASR systems obtained state-of-the-art results on the LibriSpeech, HKUST, and Switchboard ASR tasks.

doi: 10.21437/Interspeech.2021-1693

Cite as: Moritz, N., Hori, T., Roux, J.L. (2021) Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition. Proc. Interspeech 2021, 1822-1826, doi: 10.21437/Interspeech.2021-1693

  author={Niko Moritz and Takaaki Hori and Jonathan Le Roux},
  title={{Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition}},
  booktitle={Proc. Interspeech 2021},