UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
- URL: http://arxiv.org/abs/2504.11289v1
- Date: Tue, 15 Apr 2025 15:29:11 GMT
- Title: UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
- Authors: Xiang Wang, Shiwei Zhang, Longxiang Tang, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang,
- Abstract summary: UniAnimate-DiT is an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation.<n>We implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead.<n> Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations.
- Score: 45.51168344933782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report presents UniAnimate-DiT, an advanced project that leverages the cutting-edge and powerful capabilities of the open-source Wan2.1 model for consistent human image animation. Specifically, to preserve the robust generative capabilities of the original Wan2.1 model, we implement Low-Rank Adaptation (LoRA) technique to fine-tune a minimal set of parameters, significantly reducing training memory overhead. A lightweight pose encoder consisting of multiple stacked 3D convolutional layers is designed to encode motion information of driving poses. Furthermore, we adopt a simple concatenation operation to integrate the reference appearance into the model and incorporate the pose information of the reference image for enhanced pose alignment. Experimental results show that our approach achieves visually appearing and temporally consistent high-fidelity animations. Trained on 480p (832x480) videos, UniAnimate-DiT demonstrates strong generalization capabilities to seamlessly upscale to 720P (1280x720) during inference. The training and inference code is publicly available at https://github.com/ali-vilab/UniAnimate-DiT.
Related papers
- MVAnimate: Enhancing Character Animation with Multi-View Optimization [55.4217617472079]
We introduce MVAnimate, a novel framework that synthesizes both 2D and 3D information of dynamic figures based on multi-view prior information.<n>Our approach leverages multi-view prior information to produce temporally consistent and spatially coherent animation outputs.
arXiv Detail & Related papers (2026-02-09T14:55:21Z) - Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation [46.27695095774081]
2D diffusion-based methods often compromise 3D consistency and speed.<n>3D-aware facial animation feedforward methods ensure 3D consistency and achieve faster inference speed.<n>Our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art.
arXiv Detail & Related papers (2025-12-18T18:53:28Z) - Animate-X++: Universal Character Image Animation with Dynamic Backgrounds [32.04255747303296]
Animate-X++ is a universal animation framework based on DiT for various character types, including anthropomorphic characters.<n>To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner.<n>For the second challenge, we introduce a multi-task training strategy that jointly trains the animation and TI2V tasks.
arXiv Detail & Related papers (2025-08-13T03:11:28Z) - UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation.
We map the reference image along with the posture guidance and noise video into a common feature space.
We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z) - EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture [11.587428534308945]
EasyAnimate is an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes.
We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a motion module block.
We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training, and end-to-end video inference.
arXiv Detail & Related papers (2024-05-29T11:11:07Z) - Zero-shot High-fidelity and Pose-controllable Character Animation [89.74818983864832]
Image-to-video (I2V) generation aims to create a video sequence from a single image.
Existing approaches suffer from inconsistency of character appearances and poor preservation of fine details.
We propose PoseAnimate, a novel zero-shot I2V framework for character animation.
arXiv Detail & Related papers (2024-04-21T14:43:31Z) - AnimateZoo: Zero-shot Video Generation of Cross-Species Animation via Subject Alignment [64.02822911038848]
We present AnimateZoo, a zero-shot diffusion-based video generator to produce animal animations.
Key technique used in our AnimateZoo is subject alignment, which includes two steps.
Our model is capable of generating videos characterized by accurate movements, consistent appearance, and high-fidelity frames.
arXiv Detail & Related papers (2024-04-07T12:57:41Z) - AnimateZero: Video Diffusion Models are Zero-Shot Image Animators [63.938509879469024]
We propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff.
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation.
For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention.
arXiv Detail & Related papers (2023-12-06T13:39:35Z) - MagicAnimate: Temporally Consistent Human Image Animation using
Diffusion Model [74.84435399451573]
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence.
Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion.
We introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity.
arXiv Detail & Related papers (2023-11-27T18:32:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.