MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
- URL: http://arxiv.org/abs/2602.13326v1
- Date: Wed, 11 Feb 2026 03:03:44 GMT
- Title: MotionWeaver: Holistic 4D-Anchored Framework for Multi-Humanoid Image Animation
- Authors: Xirui Hu, Yanbo Ding, Jiahao Wang, Tingting Shi, Yali Wang, Guo Zhi Zhi, Weizhan Zhang,
- Abstract summary: MotionWeaver is an end-to-end framework for multi-humanoid image animation.<n>We introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters.<n>We also propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents.
- Score: 22.502601281241724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Character image animation, which synthesizes videos of reference characters driven by pose sequences, has advanced rapidly but remains largely limited to single-human settings. Existing methods struggle to generalize to multi-humanoid scenarios, which involve diverse humanoid forms, complex interactions, and frequent occlusions. We address this gap with two key innovations. First, we introduce unified motion representations that extract identity-agnostic motions and explicitly bind them to corresponding characters, enabling generalization across diverse humanoid forms and seamless extension to multi-humanoid scenarios. Second, we propose a holistic 4D-anchored paradigm that constructs a shared 4D space to fuse motion representations with video latents, and further reinforces this process with hierarchical 4D-level supervision to better handle interactions and occlusions. We instantiate these ideas in MotionWeaver, an end-to-end framework for multi-humanoid image animation. To support this setting, we curate a 46-hour dataset of multi-human videos with rich interactions, and construct a 300-video benchmark featuring paired humanoid characters. Quantitative and qualitative experiments demonstrate that MotionWeaver not only achieves state-of-the-art results on our benchmark but also generalizes effectively across diverse humanoid forms, complex interactions, and challenging multi-humanoid scenarios.
Related papers
- MultiAnimate: Pose-Guided Image Animation Made Extensible [44.163219649465866]
A Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses.<n>We propose a multi-character image animation framework built upon modern Diffusion Transformers for video generation.<n>We show that our framework achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.
arXiv Detail & Related papers (2026-02-25T05:06:58Z) - Human Video Generation from a Single Image with 3D Pose and View Control [62.676151243249556]
We present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality multi-view,temporally coherent human videos from a single image.<n>HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii)
arXiv Detail & Related papers (2026-02-24T18:42:20Z) - MoSA: Motion-Coherent Human Video Generation via Structure-Appearance Decoupling [107.8379802891245]
We propose MoSA, which decouples the process of human video generation into two components, i.e. structure generation and appearance generation.<n>MoSA substantially outperforms existing approaches across the majority of evaluation metrics.<n>This paper also contributes a large-scale human video dataset, which features more complex and diverse motions than existing human video datasets.
arXiv Detail & Related papers (2025-08-24T15:20:24Z) - InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions [70.63690961790573]
End-to-end human animation with rich multi-modal conditions has achieved remarkable advancements in recent years.<n>Most existing methods could only animate a single subject and inject conditions in a global manner.<n>We introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity'stemporal footprint.
arXiv Detail & Related papers (2025-06-11T17:57:09Z) - Multi-identity Human Image Animation with Structural Video Diffusion [73.38728096088732]
emph Structural Video Diffusion is a novel framework for generating realistic multi-human videos.<n>Our approach introduces identity-specific embeddings to maintain consistent appearances across individuals.<n>We expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios.
arXiv Detail & Related papers (2025-04-05T10:03:49Z) - Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models [71.78723353724493]
Animation of humanoid characters is essential in various graphics applications.<n>We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes.
arXiv Detail & Related papers (2025-03-20T10:00:22Z) - Harmony4D: A Video Dataset for In-The-Wild Close Human Interactions [27.677520981665012]
Harmony4D is a dataset for human-human interaction featuring in-the-wild activities such as wrestling, dancing, MMA, and more.
We use a flexible multi-view capture system to record these dynamic activities and provide annotations for human detection, tracking, 2D/3D pose estimation, and mesh recovery for closely interacting subjects.
arXiv Detail & Related papers (2024-10-27T00:05:15Z) - AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation [60.5897687447003]
AvatarGO is a novel framework designed to generate realistic 4D HOI scenes from textual inputs.
Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling issues.
As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.
arXiv Detail & Related papers (2024-10-09T17:58:56Z) - MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer [38.85054820740242]
We present a novel approach for generating high-quality, coherent human videos from a single image.
Our framework combines the strengths of diffusion transformers for capturing global correlations and CNNs for accurate condition injection.
We demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos.
arXiv Detail & Related papers (2024-05-27T17:53:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.