Related papers: DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

URL: http://arxiv.org/abs/2504.01724v3
Date: Sun, 20 Apr 2025 11:52:01 GMT
Title: DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
Authors: Yuxuan Luo, Zhengkun Rong, Lizhen Wang, Longhao Zhang, Tianshu Hu, Yongming Zhu,
Abstract summary: We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome limitations.<n>For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements.<n>Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation.
Score: 9.898947423344884
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While recent image-based human animation methods achieve realistic body and facial motion synthesis, critical gaps remain in fine-grained holistic controllability, multi-scale adaptability, and long-term temporal coherence, which leads to their lower expressiveness and robustness. We propose a diffusion transformer (DiT) based framework, DreamActor-M1, with hybrid guidance to overcome these limitations. For motion guidance, our hybrid control signals that integrate implicit facial representations, 3D head spheres, and 3D body skeletons achieve robust control of facial expressions and body movements, while producing expressive and identity-preserving animations. For scale adaptation, to handle various body poses and image scales ranging from portraits to full-body views, we employ a progressive training strategy using data with varying resolutions and scales. For appearance guidance, we integrate motion patterns from sequential frames with complementary visual references, ensuring long-term temporal coherence for unseen regions during complex movements. Experiments demonstrate that our method outperforms the state-of-the-art works, delivering expressive results for portraits, upper-body, and full-body generation with robust long-term consistency. Project Page: https://grisoon.github.io/DreamActor-M1/.

Related papers

Kling-MotionControl Technical Report [46.75274343533976]
Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image.<n>Recent strides in generative models have paved the way for high-fidelity character animation.<n>We present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation.
arXiv Detail & Related papers (2026-03-03T17:02:45Z)
DreamActor-M2: Universal Character Image Animation via Spatiotemporal In-Context Learning [24.808926786222376]
We present DreamActor-M2, a universal animation framework that reimagines motion conditioning as an in-context learning problem.<n>Our approach follows a two-stage paradigm. First, we bridge the input modality gap by fusing reference appearance and motion cues into a unified latent space.<n>Second, we introduce a self-bootstrapped data synthesis pipeline that curates pseudo cross-identity training pairs.
arXiv Detail & Related papers (2026-01-29T13:43:17Z)
High-Fidelity and Long-Duration Human Image Animation with Diffusion Transformer [17.388852038062705]
We propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos.<n>First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance.<n>Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module.
arXiv Detail & Related papers (2025-12-26T07:36:48Z)
X-UniMotion: Animating Human Images with Expressive, Unified and Identity-Agnostic Motion Latents [17.536895865783507]
We present X-UniMotion, a unified and expressive latent representation for whole-body human motion.<n>Our approach encodes multi-granular motion directly from a single image into a compact set of four disentangled latent tokens.<n>These motion latents are both highly expressive and identity-agnostic, enabling high-fidelity, detailed cross-identity motion transfer.
arXiv Detail & Related papers (2025-08-12T22:47:20Z)
LatentMove: Towards Complex Human Movement Video Generation [35.83863053692456]
We present LatentMove, a DiT-based framework specifically tailored for highly dynamic human animation.<n>Our architecture incorporates a conditional control branch and learnable face/body tokens to preserve consistency as well as fine-grained details across frames.<n>We introduce Complex-Human-Videos (CHV), a dataset featuring diverse, challenging human motions designed to benchmark the robustness of I2V systems.
arXiv Detail & Related papers (2025-05-28T07:10:49Z)
EVA: Expressive Virtual Avatars from Multi-view Videos [51.33851869426057]
We introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework.<n>EVA achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures.<n>This work represents a significant advancement towards fully drivable digital human models.
arXiv Detail & Related papers (2025-05-21T11:22:52Z)
GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z)
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis [12.987186425491242]
We propose a novel framework to generate high-fidelity, coherent talking portraits with controllable motion dynamics. In the first stage, we employ a clip-level training scheme to establish coherent global motion. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals.
arXiv Detail & Related papers (2025-04-07T08:56:01Z)
X-Dyna: Expressive Dynamic Human Image Animation [49.896933584815926]
X-Dyna is a zero-shot, diffusion-based pipeline for animating a single human image.<n>It generates realistic, context-aware dynamics for both the subject and the surrounding environment.
arXiv Detail & Related papers (2025-01-17T08:10:53Z)
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression [33.886734972316326]
GoHD is a framework designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion.<n>An animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles.<n>A conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody.<n>A two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions.
arXiv Detail & Related papers (2024-12-12T14:12:07Z)
Physically Plausible Animation of Human Upper Body from a Single Image [41.027391105867345]
We present a new method for generating controllable, dynamically responsive, and photorealistic human animations. Given an image of a person, our system allows the user to generate Physically plausible Upper Body Animation (PUBA) using interaction in the image space.
arXiv Detail & Related papers (2022-12-09T09:36:59Z)
Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z)
Drivable Volumetric Avatars using Texel-Aligned Features [52.89305658071045]
Photo telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance. We propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people.
arXiv Detail & Related papers (2022-07-20T09:28:16Z)
Video-driven Neural Physically-based Facial Asset for Production [33.24654834163312]
We present a new learning-based, video-driven approach for generating dynamic facial geometries with high-quality physically-based assets. Our technique provides higher accuracy and visual fidelity than previous video-driven facial reconstruction and animation methods.
arXiv Detail & Related papers (2022-02-11T13:22:48Z)
Imposing Temporal Consistency on Deep Monocular Body Shape and Pose Estimation [67.23327074124855]
This paper presents an elegant solution for the integration of temporal constraints in the fitting process. We derive parameters of a sequence of body models, representing shape and motion of a person, including jaw poses, facial expressions, and finger poses. Our approach enables the derivation of realistic 3D body models from image sequences, including facial expression and articulated hands.
arXiv Detail & Related papers (2022-02-07T11:11:55Z)
Style and Pose Control for Image Synthesis of Humans from a Single Monocular View [78.6284090004218]
StylePoseGAN is a non-controllable generator to accept conditioning of pose and appearance separately. Our network can be trained in a fully supervised way with human images to disentangle pose, appearance and body parts. StylePoseGAN achieves state-of-the-art image generation fidelity on common perceptual metrics.
arXiv Detail & Related papers (2021-02-22T18:50:47Z)
Monocular Real-time Full Body Capture with Inter-part Correlations [66.22835689189237]
We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image. Our approach uses a new neural network architecture that exploits correlations between body and hands at high computational efficiency.
arXiv Detail & Related papers (2020-12-11T02:37:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.