What Makes Multimodal In-Context Learning Work?

Baldassini, Folco Bertini; Shukor, Mustafa; Cord, Matthieu; Soulier, Laure; Piwowarski, Benjamin

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2404

Computer Science > Computer Vision and Pattern Recognition

Title: What Makes Multimodal In-Context Learning Work?

Authors: Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski

(Submitted on 24 Apr 2024 (v1), last revised 25 Apr 2024 (this version, v2))

Abstract: Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at this https URL

Comments:	20 pages, 16 figures. Accepted to CVPR 2024 Workshop on Prompting in Vision. Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2404.15736 [cs.CV]
	(or arXiv:2404.15736v2 [cs.CV] for this version)

Submission history

From: Folco Bertini Baldassini [view email]
[v1] Wed, 24 Apr 2024 08:50:45 GMT (2195kb,D)
[v2] Thu, 25 Apr 2024 06:04:16 GMT (2195kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.15736

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: What Makes Multimodal In-Context Learning Work?

Submission history