Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation
- URL: http://arxiv.org/abs/2212.11565v1
- Date: Thu, 22 Dec 2022 09:43:36 GMT
- Title: Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation
- Authors: Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Wynne
Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
- Abstract summary: To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
- Score: 31.882356164068753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To reproduce the success of text-to-image (T2I) generation, recent works in
text-to-video (T2V) generation employ large-scale text-video dataset for
fine-tuning. However, such paradigm is computationally expensive. Humans have
the amazing ability to learn new visual concepts from just one single exemplar.
We hereby study a new T2V generation problem$\unicode{x2014}$One-Shot Video
Generation, where only a single text-video pair is presented for training an
open-domain T2V generator. Intuitively, we propose to adapt the T2I diffusion
model pretrained on massive image data for T2V generation. We make two key
observations: 1) T2I models are able to generate images that align well with
the verb terms; 2) extending T2I models to generate multiple images
concurrently exhibits surprisingly good content consistency. To further learn
continuous motion, we propose Tune-A-Video with a tailored Sparse-Causal
Attention, which generates videos from text prompts via an efficient one-shot
tuning of pretrained T2I diffusion models. Tune-A-Video is capable of producing
temporally-coherent videos over various applications such as change of subject
or background, attribute editing, style transfer, demonstrating the versatility
and effectiveness of our method.
Related papers
- TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation [22.782099757385804]
TIP-I2V is the first large-scale dataset of user-provided text and image prompts for image-to-video generation.
We provide the corresponding generated videos from five state-of-the-art image-to-video models.
arXiv Detail & Related papers (2024-11-05T18:52:43Z) - Still-Moving: Customized Video Generation without Customized Video Data [81.09302547183155]
We introduce Still-Moving, a novel framework for customizing a text-to-video (T2V) model.
The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model.
We train lightweight $textitSpatial Adapters$ that adjust the features produced by the injected T2I layers.
arXiv Detail & Related papers (2024-07-11T17:06:53Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models [94.25084162939488]
Text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment.
We introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I.
arXiv Detail & Related papers (2024-03-08T16:44:54Z) - VideoCrafter1: Open Diffusion Models for High-Quality Video Generation [97.5767036934979]
We introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models.
T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input.
Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 times 576$, outperforming other open-source T2V models in terms of quality.
arXiv Detail & Related papers (2023-10-30T13:12:40Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - SimDA: Simple Diffusion Adapter for Efficient Video Generation [102.90154301044095]
We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
arXiv Detail & Related papers (2023-08-18T17:58:44Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.