StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation
- URL: http://arxiv.org/abs/2507.15064v1
- Date: Sun, 20 Jul 2025 17:59:26 GMT
- Title: StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation
- Authors: Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi Cheng, Qi Dai, Chong Luo, Zuxuan Wu, Yu-Gang Jiang,
- Abstract summary: Current diffusion models for human image animation often struggle to maintain identity consistency.<n>We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment.<n>We show how StableAnimator++ generates high-quality videos conditioned on a reference image and a pose sequence without any post-processing.
- Score: 98.10527466949338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current diffusion models for human image animation often struggle to maintain identity (ID) consistency, especially when the reference image and driving video differ significantly in body size or position. We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment, capable of generating high-quality videos conditioned on a reference image and a pose sequence without any post-processing. Building upon a video diffusion model, StableAnimator++ contains carefully designed modules for both training and inference, striving for identity consistency. In particular, StableAnimator++ first uses learnable layers to predict the similarity transformation matrices between the reference image and the driven poses via injecting guidance from Singular Value Decomposition (SVD). These matrices align the driven poses with the reference image, mitigating misalignment to a great extent. StableAnimator++ then computes image and face embeddings using off-the-shelf encoders, refining the face embeddings via a global content-aware Face Encoder. To further maintain ID, we introduce a distribution-aware ID Adapter that counteracts interference caused by temporal layers while preserving ID via distribution alignment. During the inference stage, we propose a novel Hamilton-Jacobi-Bellman (HJB) based face optimization integrated into the denoising process, guiding the diffusion trajectory for enhanced facial fidelity. Experiments on benchmarks show the effectiveness of StableAnimator++ both qualitatively and quantitatively.
Related papers
- DisPose: Disentangling Pose Guidance for Controllable Human Image Animation [13.366879755548636]
DisPose aims to disentangle the sparse skeleton pose in human image animation into motion field guidance and keypoint correspondence.<n>To seamlessly integrate into existing models, we propose a plug-and-play hybrid ControlNet.
arXiv Detail & Related papers (2024-12-12T15:15:59Z) - StableAnimator: High-Quality Identity-Preserving Human Image Animation [64.63765800569935]
This paper presents StableAnimator, the first end-to-end ID-preserving video diffusion framework.<n>It synthesizes high-quality videos without any post-processing, conditioned on a reference image and a sequence of poses.<n>During inference, we propose a novel Hamilton-JacobiBellman (HJB) equation-based optimization to further enhance the face quality.
arXiv Detail & Related papers (2024-11-26T18:59:22Z) - UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation.
We map the reference image along with the posture guidance and noise video into a common feature space.
We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [16.438935466843304]
ID-Animator is a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training.
Our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models.
arXiv Detail & Related papers (2024-04-23T17:59:43Z) - Zero-shot High-fidelity and Pose-controllable Character Animation [89.74818983864832]
Image-to-video (I2V) generation aims to create a video sequence from a single image.
Existing approaches suffer from inconsistency of character appearances and poor preservation of fine details.
We propose PoseAnimate, a novel zero-shot I2V framework for character animation.
arXiv Detail & Related papers (2024-04-21T14:43:31Z) - MagicAnimate: Temporally Consistent Human Image Animation using
Diffusion Model [74.84435399451573]
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence.
Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion.
We introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity.
arXiv Detail & Related papers (2023-11-27T18:32:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.