AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
- URL: http://arxiv.org/abs/2312.03793v1
- Date: Wed, 6 Dec 2023 13:39:35 GMT
- Title: AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
- Authors: Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying
Shan, Jian Zhang
- Abstract summary: We propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff.
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation.
For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention.
- Score: 63.938509879469024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale text-to-video (T2V) diffusion models have great progress in
recent years in terms of visual quality, motion and temporal consistency.
However, the generation process is still a black box, where all attributes
(e.g., appearance, motion) are learned and generated jointly without precise
control ability other than rough text descriptions. Inspired by image animation
which decouples the video as one specific appearance with the corresponding
motion, we propose AnimateZero to unveil the pre-trained text-to-video
diffusion model, i.e., AnimateDiff, and provide more precise appearance and
motion control abilities for it. For appearance control, we borrow intermediate
latents and their features from the text-to-image (T2I) generation for ensuring
the generated first frame is equal to the given generated image. For temporal
control, we replace the global temporal attention of the original T2V model
with our proposed positional-corrected window attention to ensure other frames
align with the first frame well. Empowered by the proposed methods, AnimateZero
can successfully control the generating progress without further training. As a
zero-shot image animator for given images, AnimateZero also enables multiple
new applications, including interactive video generation and real image
animation. The detailed experiments demonstrate the effectiveness of the
proposed method in both T2V and related applications.
Related papers
- UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation.
We map the reference image along with the posture guidance and noise video into a common feature space.
We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z) - Zero-shot High-fidelity and Pose-controllable Character Animation [89.74818983864832]
Image-to-video (I2V) generation aims to create a video sequence from a single image.
Existing approaches suffer from inconsistency of character appearances and poor preservation of fine details.
We propose PoseAnimate, a novel zero-shot I2V framework for character animation.
arXiv Detail & Related papers (2024-04-21T14:43:31Z) - LatentMan: Generating Consistent Animated Characters using Image Diffusion Models [44.18315132571804]
We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models.
Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference.
arXiv Detail & Related papers (2023-12-12T10:07:37Z) - Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [27.700371215886683]
diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities.
In this paper, we propose a novel framework tailored for character animation.
By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods.
arXiv Detail & Related papers (2023-11-28T12:27:15Z) - MagicAnimate: Temporally Consistent Human Image Animation using
Diffusion Model [74.84435399451573]
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence.
Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion.
We introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity.
arXiv Detail & Related papers (2023-11-27T18:32:31Z) - AnimateAnything: Fine-Grained Open Domain Image Animation with Motion
Guidance [13.416296247896042]
We introduce an open domain image animation method that leverages the motion prior of video diffusion model.
Our approach introduces targeted motion area guidance and motion strength guidance, enabling precise control of the movable area and its motion speed.
We validate the effectiveness of our method through rigorous experiments on an open-domain dataset.
arXiv Detail & Related papers (2023-11-21T03:47:54Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models
without Specific Tuning [92.33690050667475]
AnimateDiff is a framework for animating personalized T2I models without requiring model-specific tuning.
We propose MotionLoRA, a lightweight fine-tuning technique for AnimateDiff that enables a pre-trained motion module to adapt to new motion patterns.
Results show that our approaches help these models generate temporally smooth animation clips while preserving the visual quality and motion diversity.
arXiv Detail & Related papers (2023-07-10T17:34:16Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.