We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.CV

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Computer Vision and Pattern Recognition

Title: Scalable Diffusion Models with State Space Backbone

Abstract: This paper presents a new exploration into a category of diffusion models built upon state space architecture. We endeavor to train diffusion models for image data, wherein the traditional U-Net backbone is supplanted by a state space backbone, functioning on raw patches or latent space. Given its notable efficacy in accommodating long-range dependencies, Diffusion State Space Models (DiS) are distinguished by treating all inputs including time, condition, and noisy image patches as tokens. Our assessment of DiS encompasses both unconditional and class-conditional image generation scenarios, revealing that DiS exhibits comparable, if not superior, performance to CNN-based or Transformer-based U-Net architectures of commensurate size. Furthermore, we analyze the scalability of DiS, gauged by the forward pass complexity quantified in Gflops. DiS models with higher Gflops, achieved through augmentation of depth/width or augmentation of input tokens, consistently demonstrate lower FID. In addition to demonstrating commendable scalability characteristics, DiS-H/2 models in latent space achieve performance levels akin to prior diffusion models on class-conditional ImageNet benchmarks at the resolution of 256$\times$256 and 512$\times$512, while significantly reducing the computational burden. The code and models are available at: this https URL
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as: arXiv:2402.05608 [cs.CV]
  (or arXiv:2402.05608v3 [cs.CV] for this version)

Submission history

From: Zhengcong Fei [view email]
[v1] Thu, 8 Feb 2024 12:08:42 GMT (6364kb,D)
[v2] Sun, 25 Feb 2024 03:06:17 GMT (6351kb,D)
[v3] Thu, 28 Mar 2024 08:28:44 GMT (6554kb,D)

Link back to: arXiv, form interface, contact.