Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Wu, Xun; Huang, Shaohan; Wei, Furu

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2404

Computer Science > Computer Vision and Pattern Recognition

Title: Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Authors: Xun Wu, Shaohan Huang, Furu Wei

(Submitted on 23 Apr 2024)

Abstract: Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2404.15100 [cs.CV]
	(or arXiv:2404.15100v1 [cs.CV] for this version)

Submission history

From: Shaohan Huang [view email]
[v1] Tue, 23 Apr 2024 14:53:15 GMT (12801kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.15100

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation

Submission history