InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

Li, Shufan; Singh, Harkanwar; Grover, Aditya

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2312

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

Authors: Shufan Li, Harkanwar Singh, Aditya Grover

(Submitted on 11 Dec 2023 (v1), last revised 26 Apr 2024 (this version, v3))

Abstract: The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks. The code is available at this https URL

Comments:	29 pages, 14 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.06738 [cs.CV]
	(or arXiv:2312.06738v3 [cs.CV] for this version)

Submission history

From: Shufan Li [view email]
[v1] Mon, 11 Dec 2023 17:53:45 GMT (21438kb,D)
[v2] Sat, 30 Dec 2023 23:04:37 GMT (22520kb,D)
[v3] Fri, 26 Apr 2024 05:52:31 GMT (38634kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2312.06738

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following

Submission history