Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Zhao, Henry Hengyuan; Zhou, Pan; Shou, Mike Zheng

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2312

Computer Science > Computer Vision and Pattern Recognition

Title: Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Authors: Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou

(Submitted on 11 Dec 2023 (v1), last revised 24 Apr 2024 (this version, v4))

Abstract: Instruction tuning data is essential for training the Multimodal Large Language Models (MLLMs). However, the creation of high-quality instruction tuning data presents significant challenges. Asking the human to label the instruction tuning data is label-intensive and time-consuming. Some works prompted to GPT-4 for data generation were not only costly but also lacked satisfactory performance in complex tasks (i.e., grounding-based reasoning tasks). To address the challenges of data creation, we are the first to explore the potential of empowering MLLMs with the ability to generate instruction-tuning data by following user instructions. Specifically, we developed an innovative data generation pipeline Genixer to generate various high-quality instruction tuning data, including nine representative tasks, e.g., Common VQA, REC, REG, and PointQ. Genixer provides a unified solution for data generation with four key steps: (i) instruction data collection, (ii) instruction template design, (iii) empowering MLLM, and (iv) data generation and filtering. To validate the effectiveness of generated data, we conducted the human evaluation and user preference study to assess the quality of generated data. Subsequently, we generated two instruction-tuning datasets for the training of two representative MLLMs, LLaVA1.5 and Shikra, and noted consistent improvements across various VQA tasks and multimodal benchmarks. For instance, performance on the VizWiz benchmark improved from 50.0% to 53.8%, and on ScienceQA, it increased from 66.8% to 69.7%, reconfirming the quality of the generated instruction tuning data. The data, code, and models will be released.

Comments:	Technical report
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2312.06731 [cs.CV]
	(or arXiv:2312.06731v4 [cs.CV] for this version)

Submission history

From: Zhao Hengyuan [view email]
[v1] Mon, 11 Dec 2023 09:44:41 GMT (6973kb,D)
[v2] Tue, 19 Mar 2024 09:13:22 GMT (8334kb,D)
[v3] Wed, 20 Mar 2024 07:00:39 GMT (8335kb,D)
[v4] Wed, 24 Apr 2024 07:05:11 GMT (8323kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2312.06731

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Submission history