We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CV

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computer Vision and Pattern Recognition

Title: Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Abstract: Instruction tuning data is essential for training the Multimodal Large Language Models (MLLMs). However, the creation of high-quality instruction tuning data presents significant challenges. Asking the human to label the instruction tuning data is label-intensive and time-consuming. Some works prompted to GPT-4 for data generation were not only costly but also lacked satisfactory performance in complex tasks (i.e., grounding-based reasoning tasks). To address the challenges of data creation, we are the first to explore the potential of empowering MLLMs with the ability to generate instruction-tuning data by following user instructions. Specifically, we developed an innovative data generation pipeline Genixer to generate various high-quality instruction tuning data, including nine representative tasks, e.g., Common VQA, REC, REG, and PointQ. Genixer provides a unified solution for data generation with four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLM, and (iv) data generation and filtering. To validate the effectiveness of generated data, we conducted the human evaluation and user preference study to assess the quality of generated data. Subsequently, we generated two instruction-tuning datasets for the training of two representative MLLMs, LLaVA1.5 and Shikra, and noted consistent improvements across various VQA tasks and multimodal benchmarks. For instance, performance on the VizWiz benchmark improved from 50.0% to 53.8%, and on ScienceQA, it increased from 66.8% to 69.7%, reconfirming the quality of the generated instruction tuning data. The data, code, and models will be released.
Comments: Technical report
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as: arXiv:2312.06731 [cs.CV]
  (or arXiv:2312.06731v4 [cs.CV] for this version)

Submission history

From: Zhao Hengyuan [view email]
[v1] Mon, 11 Dec 2023 09:44:41 GMT (6973kb,D)
[v2] Tue, 19 Mar 2024 09:13:22 GMT (8334kb,D)
[v3] Wed, 20 Mar 2024 07:00:39 GMT (8335kb,D)
[v4] Wed, 24 Apr 2024 07:05:11 GMT (8323kb,D)

Link back to: arXiv, form interface, contact.