ISCA Archive Interspeech 2022
ISCA Archive Interspeech 2022

Low-Latency Online Streaming VideoQA Using Audio-Visual Transformers

Chiori Hori, Takaaki Hori, Jonathan Le Roux

To apply scene-aware interaction technology to real-time dialog systems, we propose an online low-latency response generation framework for scene-aware interaction using a video question answering setup. This paper extends our prior work on low-latency video captioning to build a novel approach that can optimize the timing to generate each answer under a trade-off between latency of generation and quality of answer. For video QA, the timing detector is now in charge of finding a timing for the question-relevant event, instead of determining when the system has seen enough to generate a general caption as in the video captioning case. Our audio visual scene-aware dialog system built for the 10th Dialog System Technology Challenge was extended to exploit a low-latency function. Experiments with the MSRVTT-QA and AVSD datasets show that our approach achieves between 97% and 99% of the answer quality of the upper bound given by a pre-trained Transformer using the entire video clips, using less than 40% of frames from the beginning.


doi: 10.21437/Interspeech.2022-10891

Cite as: Hori, C., Hori, T., Le Roux, J. (2022) Low-Latency Online Streaming VideoQA Using Audio-Visual Transformers. Proc. Interspeech 2022, 4511-4515, doi: 10.21437/Interspeech.2022-10891

@inproceedings{hori22_interspeech,
  author={Chiori Hori and Takaaki Hori and Jonathan {Le Roux}},
  title={{Low-Latency Online Streaming VideoQA Using Audio-Visual Transformers}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={4511--4515},
  doi={10.21437/Interspeech.2022-10891}
}