References & Citations
Computer Science > Computer Vision and Pattern Recognition
Title: What Makes Multimodal In-Context Learning Work?
(Submitted on 24 Apr 2024 (v1), last revised 25 Apr 2024 (this version, v2))
Abstract: Large Language Models have demonstrated remarkable performance across various tasks, exhibiting the capacity to swiftly acquire new skills, such as through In-Context Learning (ICL) with minimal demonstration examples. In this work, we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g., IDEFICS, OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES), M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover, we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. Code available at this https URL
Submission history
From: Folco Bertini Baldassini [view email][v1] Wed, 24 Apr 2024 08:50:45 GMT (2195kb,D)
[v2] Thu, 25 Apr 2024 06:04:16 GMT (2195kb,D)
Link back to: arXiv, form interface, contact.