Learned representation-guided diffusion models for large-image generation

Graikos, Alexandros; Yellapragada, Srikar; Le, Minh-Quan; Kapse, Saarthak; Prasanna, Prateek; Saltz, Joel; Samaras, Dimitris

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2312

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: Learned representation-guided diffusion models for large-image generation

Authors: Alexandros Graikos, Srikar Yellapragada, Minh-Quan Le, Saarthak Kapse, Prateek Prasanna, Joel Saltz, Dimitris Samaras

(Submitted on 12 Dec 2023 (v1), last revised 28 Mar 2024 (this version, v2))

Abstract: To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.07330 [cs.CV]
	(or arXiv:2312.07330v2 [cs.CV] for this version)

Submission history

From: Alexandros Graikos [view email]
[v1] Tue, 12 Dec 2023 14:45:45 GMT (33743kb,D)
[v2] Thu, 28 Mar 2024 17:07:38 GMT (33481kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2312.07330

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: Learned representation-guided diffusion models for large-image generation

Submission history