TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Ni, Haomiao; Egger, Bernhard; Lohit, Suhas; Cherian, Anoop; Wang, Ye; Koike-Akino, Toshiaki; Huang, Sharon X.; Marks, Tim K.

Full-text links:

Download:

Current browse context:

cs.CV

< prev | next >

new | recent | 2404

Change to browse by:

Computer Science > Computer Vision and Pattern Recognition

Title: TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Authors: Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks

(Submitted on 25 Apr 2024)

Abstract: Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

Comments:	CVPR 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2404.16306 [cs.CV]
	(or arXiv:2404.16306v1 [cs.CV] for this version)

Submission history

From: Haomiao Ni [view email]
[v1] Thu, 25 Apr 2024 03:21:11 GMT (1693kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:2404.16306

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Computer Vision and Pattern Recognition

Title: TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Submission history