Retrieval Enhanced Zero-Shot Video Captioning

Ma, Yunchuan; Qing, Laiyun; Li, Guorong; Qi, Yuankai; Sheng, Quan Z.; Huang, Qingming

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2405

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Retrieval Enhanced Zero-Shot Video Captioning

Authors: Yunchuan Ma, Laiyun Qing, Guorong Li, Yuankai Qi, Quan Z. Sheng, Qingming Huang

(Submitted on 11 May 2024)

Abstract: Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose to take advantage of existing pre-trained large-scale vision and language models to directly generate captions with test time adaptation. Specifically, we bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP as well as frozen CLIP. Differing from the conventional way to train these tokens with training data, we update these tokens with pseudo-targets of the inference data under several carefully crafted loss functions which enable the tokens to absorb video information catered for GPT-2. This procedure can be done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2405.07046 [cs.CV]
	(or arXiv:2405.07046v1 [cs.CV] for this version)

Submission history

From: Yunchuan Ma [view email]
[v1] Sat, 11 May 2024 16:22:00 GMT (978kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2405.07046

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Retrieval Enhanced Zero-Shot Video Captioning

Submission history