LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
- URL: http://arxiv.org/abs/2310.10769v1
- Date: Mon, 16 Oct 2023 19:03:19 GMT
- Title: LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation
- Authors: Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, Xiangyu
Zhang
- Abstract summary: We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
- Score: 44.220329202024494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the impressive progress in diffusion-based text-to-image generation,
extending such powerful generative ability to text-to-video raises enormous
attention. Existing methods either require large-scale text-video pairs and a
large number of training resources or learn motions that are precisely aligned
with template videos. It is non-trivial to balance a trade-off between the
degree of generation freedom and the resource costs for video generation. In
our study, we present a few-shot-based tuning framework, LAMP, which enables
text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos
on a single GPU. Specifically, we design a first-frame-conditioned pipeline
that uses an off-the-shelf text-to-image model for content generation so that
our tuned video diffusion model mainly focuses on motion learning. The
well-developed text-to-image techniques can provide visually pleasing and
diverse content as generation conditions, which highly improves video quality
and generation freedom. To capture the features of temporal dimension, we
expand the pretrained 2D convolution layers of the T2I model to our novel
temporal-spatial motion learning layers and modify the attention blocks to the
temporal level. Additionally, we develop an effective inference trick,
shared-noise sampling, which can improve the stability of videos with
computational costs. Our method can also be flexibly applied to other tasks,
e.g. real-world image animation and video editing. Extensive experiments
demonstrate that LAMP can effectively learn the motion pattern on limited data
and generate high-quality videos. The code and models are available at
https://rq-wu.github.io/projects/LAMP.
Related papers
- EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture [11.587428534308945]
EasyAnimate is an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes.
We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block.
We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training, and end-to-end video inference.
arXiv Detail & Related papers (2024-05-29T11:11:07Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - AnimateZero: Video Diffusion Models are Zero-Shot Image Animators [63.938509879469024]
We propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff.
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation.
For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention.
arXiv Detail & Related papers (2023-12-06T13:39:35Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary
Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video.
Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.