SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
- URL: http://arxiv.org/abs/2511.19320v1
- Date: Mon, 24 Nov 2025 17:15:55 GMT
- Title: SteadyDancer: Harmonized and Coherent Human Image Animation with First-Frame Preservation
- Authors: Jiaming Zhang, Shengming Cao, Rui Li, Xiaotong Zhao, Yutao Cui, Xinglin Hou, Gangshan Wu, Haolan Chen, Yu Xu, Limin Wang, Kai Ma,
- Abstract summary: We introduce SteadyDancer, an Image-to-Video (R2V) paradigm-based framework that achieves harmonized and coherent animation.<n> Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control.
- Score: 50.792027578906804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preserving first-frame identity while ensuring precise motion control is a fundamental challenge in human image animation. The Image-to-Motion Binding process of the dominant Reference-to-Video (R2V) paradigm overlooks critical spatio-temporal misalignments common in real-world applications, leading to failures such as identity drift and visual artifacts. We introduce SteadyDancer, an Image-to-Video (I2V) paradigm-based framework that achieves harmonized and coherent animation and is the first to ensure first-frame preservation robustly. Firstly, we propose a Condition-Reconciliation Mechanism to harmonize the two conflicting conditions, enabling precise control without sacrificing fidelity. Secondly, we design Synergistic Pose Modulation Modules to generate an adaptive and coherent pose representation that is highly compatible with the reference image. Finally, we employ a Staged Decoupled-Objective Training Pipeline that hierarchically optimizes the model for motion fidelity, visual quality, and temporal coherence. Experiments demonstrate that SteadyDancer achieves state-of-the-art performance in both appearance fidelity and motion control, while requiring significantly fewer training resources than comparable methods.
Related papers
- Kling-MotionControl Technical Report [46.75274343533976]
Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image.<n>Recent strides in generative models have paved the way for high-fidelity character animation.<n>We present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation.
arXiv Detail & Related papers (2026-03-03T17:02:45Z) - IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation [58.297199313494]
Implicit methods capture motion semantics directly from driving video, but suffer from identity leakage and entanglement between motion and appearance.<n>We propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens.<n>Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity.
arXiv Detail & Related papers (2026-02-07T11:17:20Z) - AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation [45.753757870577196]
We introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning.<n>We show that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses.
arXiv Detail & Related papers (2026-02-04T15:42:58Z) - DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning [24.808926786222376]
We present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem.<n>Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space.<n>Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs.
arXiv Detail & Related papers (2026-01-29T13:43:17Z) - PhysCorr: Dual-Reward DPO for Physics-Constrained Text-to-Video Generation with Automated Preference Selection [10.498184571108995]
We propose PhysCorr, a unified framework for modeling, evaluating, and optimizing physical consistency in video generation.<n>Specifically, we introduce PhysicsRM, the first dual-dimensional reward model that quantifies both intra-object stability and inter-object interactions.<n>Our approach is model-agnostic and scalable, enabling seamless integration into a wide range of video diffusion and transformer-based backbones.
arXiv Detail & Related papers (2025-11-06T02:40:57Z) - AvatarVTON: 4D Virtual Try-On for Animatable Avatars [67.13031660684457]
AvatarVTON generates realistic try-on results from a single in-shop garment image.<n>It supports dynamic garment interactions under single-view supervision.<n>It is well-suited for AR/VR, gaming, and digital-human applications.
arXiv Detail & Related papers (2025-10-06T14:06:34Z) - StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing [63.72095377128904]
The visual dubbing task aims to generate mouth movements synchronized with the driving audio.<n>Audio-only driving paradigms inadequately capture speaker-specific lip habits.<n>Blind-inpainting approaches produce visual artifacts when handling obstructions.
arXiv Detail & Related papers (2025-09-26T05:23:31Z) - AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars [71.90109867684025]
Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans.<n>We propose AsynFusion, a novel framework that leverages diffusion transformers to achieve cohesive expression and gesture synthesis.<n>AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations.
arXiv Detail & Related papers (2025-05-21T03:28:53Z) - FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis [12.987186425491242]
We propose a novel framework to generate high-fidelity, coherent talking portraits with controllable motion dynamics.<n>In the first stage, we employ a clip-level training scheme to establish coherent global motion.<n>In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals.
arXiv Detail & Related papers (2025-04-07T08:56:01Z) - DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
We propose DynamiCtrl, a novel framework for human animation in video DiT architecture.<n>We use a shared VAE encoder for human images and driving poses, unifying them into a common latent space.<n>We also introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context.
arXiv Detail & Related papers (2025-03-27T08:07:45Z) - Zero-shot High-fidelity and Pose-controllable Character Animation [89.74818983864832]
Image-to-video (I2V) generation aims to create a video sequence from a single image.
Existing approaches suffer from inconsistency of character appearances and poor preservation of fine details.
We propose PoseAnimate, a novel zero-shot I2V framework for character animation.
arXiv Detail & Related papers (2024-04-21T14:43:31Z) - Drivable Volumetric Avatars using Texel-Aligned Features [52.89305658071045]
Photo telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance.
We propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people.
arXiv Detail & Related papers (2022-07-20T09:28:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.