Integrating Video Retrieval and Moment Detection in a Unified Corpus for Video Question Answering

Hongyin Luo, Mitra Mohtarami, James Glass, Karthik Krishnamurthy, Brigitte Richardson


Traditional video question answering models have been designed to retrieve videos to answer input questions. A drawback of this scenario is that users have to watch the entire video to find their desired answer. Recent work presented unsupervised neural models with attention mechanisms to find moments or segments from retrieved videos to provide accurate answers to input questions. Although these two tasks look similar, the latter is more challenging because the former task only needs to judge whether the question is answered in a video and returns the entire video, while the latter is expected to judge which moment within a video matches the question and accurately returns a segment of the video. Moreover, there is a lack of labeled data for training moment detection models. In this paper, we focus on integrating video retrieval and moment detection in a unified corpus. We further develop two models — a self-attention convolutional network and a memory network — for the tasks. Experimental results on our corpus show that the neural models can accurately detect and retrieve moments in supervised settings.


 DOI: 10.21437/Interspeech.2019-1736

Cite as: Luo, H., Mohtarami, M., Glass, J., Krishnamurthy, K., Richardson, B. (2019) Integrating Video Retrieval and Moment Detection in a Unified Corpus for Video Question Answering. Proc. Interspeech 2019, 599-603, DOI: 10.21437/Interspeech.2019-1736.


@inproceedings{Luo2019,
  author={Hongyin Luo and Mitra Mohtarami and James Glass and Karthik Krishnamurthy and Brigitte Richardson},
  title={{Integrating Video Retrieval and Moment Detection in a Unified Corpus for Video Question Answering}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={599--603},
  doi={10.21437/Interspeech.2019-1736},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1736}
}