Generating Natural Video Descriptions via Multimodal Processing

Qin Jin, Junwei Liang, Xiaozhu Lin


Generating natural language descriptions of visual content is an intriguing task which has wide applications such as assisting blind people. The recent advances in image captioning stimulate further study of this task in more depth including generating natural descriptions for videos. Most works of video description generation focus on visual information in the video. However, audio provides rich information for describing video contents as well. In this paper, we propose to generate video descriptions in natural sentences via multimodal processing, which refers to using both audio and visual cues via unified deep neural networks with both convolutional and recurrent structure. Experimental results on the Microsoft Research Video Description (MSVD) corpus prove that fusing audio information greatly improves the video description performance. We also investigate the impact of image amount vs caption amount on the image caption performance and see the trend that when limited amount of training is available, number of various captions is more important than number of various images. This will guide us to investigate in the future how to improve the video description system via increasing amount of training data.


DOI: 10.21437/Interspeech.2016-380

Cite as

Jin, Q., Liang, J., Lin, X. (2016) Generating Natural Video Descriptions via Multimodal Processing. Proc. Interspeech 2016, 570-574.

Bibtex
@inproceedings{Jin+2016,
author={Qin Jin and Junwei Liang and Xiaozhu Lin},
title={Generating Natural Video Descriptions via Multimodal Processing},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-380},
url={http://dx.doi.org/10.21437/Interspeech.2016-380},
pages={570--574}
}