Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog

Chiori Hori, Anoop Cherian, Tim K. Marks, Takaaki Hori


Multimodal fusion of audio, vision, and text has demonstrated significant benefits in advancing the performance of several tasks, including machine translation, video captioning, and video summarization. Audio-Visual Scene-aware Dialog (AVSD) is a new and more challenging task, proposed recently, that focuses on generating sentence responses to questions that are asked in a dialog about video content. While prior approaches designed to tackle this task have shown the need for multimodal fusion to improve response quality, the best-performing systems often rely heavily on human-generated summaries of the video content, which are unavailable when such systems are deployed in real-world. This paper investigates how to compensate for such information, which is missing in the inference phase but available during the training phase. To this end, we propose a novel AVSD system using student-teacher learning, in which a student network is (jointly) trained to mimic the teacher’s responses. Our experiments demonstrate that in addition to yielding state-of-the-art accuracy against the baseline DSTC7-AVSD system, the proposed approach (which does not use human-generated summaries at test time) performs competitively with methods that do use those summaries.


 DOI: 10.21437/Interspeech.2019-3143

Cite as: Hori, C., Cherian, A., Marks, T.K., Hori, T. (2019) Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog. Proc. Interspeech 2019, 1886-1890, DOI: 10.21437/Interspeech.2019-3143.


@inproceedings{Hori2019,
  author={Chiori Hori and Anoop Cherian and Tim K. Marks and Takaaki Hori},
  title={{Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={1886--1890},
  doi={10.21437/Interspeech.2019-3143},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3143}
}