ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

Madasu, Avinash; Lal, Vasudev

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2306

Computer Science > Computer Vision and Pattern Recognition

Title: ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

Authors: Avinash Madasu, Vasudev Lal

(Submitted on 28 Jun 2023 (v1), last revised 17 Apr 2024 (this version, v2))

Abstract: Video retrieval (VR) involves retrieving the ground truth video from the video database given a text caption or vice-versa. The two important components of compositionality: objects & attributes and actions are joined using correct syntax to form a proper text query. These components (objects & attributes, actions and syntax) each play an important role to help distinguish among videos and retrieve the correct ground truth video. However, it is unclear what is the effect of these components on the video retrieval performance. We therefore, conduct a systematic study to evaluate the compositional and syntactic understanding of video retrieval models on standard benchmarks such as MSRVTT, MSVD and DIDEMO. The study is performed on two categories of video retrieval models: (i) which are pre-trained on video-text pairs and fine-tuned on downstream video retrieval datasets (Eg. Frozen-in-Time, Violet, MCQ etc.) (ii) which adapt pre-trained image-text representations like CLIP for video retrieval (Eg. CLIP4Clip, XCLIP, CLIP2Video etc.). Our experiments reveal that actions and syntax play a minor role compared to objects & attributes in video understanding. Moreover, video retrieval models that use pre-trained image-text representations (CLIP) have better syntactic and compositional understanding as compared to models pre-trained on video-text data. The code is available at this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2306.16533 [cs.CV]
	(or arXiv:2306.16533v2 [cs.CV] for this version)

Submission history

From: Avinash Madasu [view email]
[v1] Wed, 28 Jun 2023 20:06:36 GMT (110kb,D)
[v2] Wed, 17 Apr 2024 11:38:12 GMT (115kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2306.16533

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models

Submission history