Related papers: CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation

URL: http://arxiv.org/abs/2601.11096v1
Date: Fri, 16 Jan 2026 08:53:09 GMT
Title: CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation
Authors: Shuai Tan, Biao Gong, Ke Ma, Yutong Feng, Qiyuan Zhang, Yan Wang, Yujun Shen, Hengshuang Zhao,
Abstract summary: CoDance is a novel Unbind-Rebind framework that enables the animation of arbitrary subject counts, types, and spatial configurations conditioned on a single pose sequence.<n>To ensure precise control and subject association, we then devise a Rebind module, leveraging semantic guidance from text prompts and spatial guidance from subject masks to direct the learned motion to intended characters.<n>Experiments on CoDanceBench and existing datasets show that CoDance achieves SOTA performance, exhibiting remarkable generalization across diverse subjects and spatial layouts.
Score: 95.46061771820412
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Character image animation is gaining significant importance across various domains, driven by the demand for robust and flexible multi-subject rendering. While existing methods excel in single-person animation, they struggle to handle arbitrary subject counts, diverse character types, and spatial misalignment between the reference image and the driving poses. We attribute these limitations to an overly rigid spatial binding that forces strict pixel-wise alignment between the pose and reference, and an inability to consistently rebind motion to intended subjects. To address these challenges, we propose CoDance, a novel Unbind-Rebind framework that enables the animation of arbitrary subject counts, types, and spatial configurations conditioned on a single, potentially misaligned pose sequence. Specifically, the Unbind module employs a novel pose shift encoder to break the rigid spatial binding between the pose and the reference by introducing stochastic perturbations to both poses and their latent features, thereby compelling the model to learn a location-agnostic motion representation. To ensure precise control and subject association, we then devise a Rebind module, leveraging semantic guidance from text prompts and spatial guidance from subject masks to direct the learned motion to intended characters. Furthermore, to facilitate comprehensive evaluation, we introduce a new multi-subject CoDanceBench. Extensive experiments on CoDanceBench and existing datasets show that CoDance achieves SOTA performance, exhibiting remarkable generalization across diverse subjects and spatial layouts. The code and weights will be open-sourced.

Related papers

MultiAnimate: Pose-Guided Image Animation Made Extensible [44.163219649465866]
A Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses.<n>We propose a multi-character image animation framework built upon modern Diffusion Transformers for video generation.<n>We show that our framework achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.
arXiv Detail & Related papers (2026-02-25T05:06:58Z)
One-to-All Animation: Alignment-Free Character Animation and Image Pose Transfer [36.26551019954542]
We present One-to-All Animation, a framework for high-fidelity character animation and image pose transfer.<n>To handle spatially misaligned reference, we reformulate training as a self-supervised outpainting task.<n>We also design a reference extractor for comprehensive identity feature extraction.
arXiv Detail & Related papers (2025-11-28T07:30:10Z)
DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation [63.781450025764904]
We propose DynamiCtrl, a novel framework for human animation in video DiT architecture.<n>We use a shared VAE encoder for human images and driving poses, unifying them into a common latent space.<n>We also introduce the "Joint-text" paradigm, which preserves the role of text embeddings to provide global semantic context.
arXiv Detail & Related papers (2025-03-27T08:07:45Z)
Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects.<n>Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion.<n>Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities.<n>We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z)
VINECS: Video-based Neural Character Skinning [82.39776643541383]
We propose a fully automated approach for creating a fully rigged character with pose-dependent skinning weights. We show that our approach outperforms state-of-the-art while not relying on dense 4D scans.
arXiv Detail & Related papers (2023-07-03T08:35:53Z)
Shuffled Autoregression For Motion Interpolation [53.61556200049156]
This work aims to provide a deep-learning solution for the motion task. We propose a novel framework, referred to as emphShuffled AutoRegression, which expands the autoregression to generate in arbitrary (shuffled) order. We also propose an approach to constructing a particular kind of dependency graph, with three stages assembled into an end-to-end spatial-temporal motion Transformer.
arXiv Detail & Related papers (2023-06-10T07:14:59Z)
PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling [30.93155530590843]
We present PoseVocab, a novel pose encoding method that can encode high-fidelity human details. Given multi-view RGB videos of a character, PoseVocab constructs key poses and latent embeddings based on the training poses. Experiments show that our method outperforms other state-of-the-art baselines.
arXiv Detail & Related papers (2023-04-25T17:25:36Z)
Neural Rendering of Humans in Novel View and Pose from Monocular Video [68.37767099240236]
We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input. Our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.
arXiv Detail & Related papers (2022-04-04T03:09:20Z)
Hierarchical Neural Implicit Pose Network for Animation and Motion Retargeting [66.69067601079706]
HIPNet is a neural implicit pose network trained on multiple subjects across many poses. We employ a hierarchical skeleton-based representation to learn a signed distance function on a canonical unposed space. We achieve state-of-the-art results on various single-subject and multi-subject benchmarks.
arXiv Detail & Related papers (2021-12-02T03:25:46Z)
A Hierarchy-Aware Pose Representation for Deep Character Animation [2.47343886645587]
We present a robust pose representation for motion modeling, suitable for deep character animation. Our representation is based on dual quaternions, the mathematical abstractions with well-defined operations, which simultaneously encode rotational and positional orientation. We show that our representation overcomes common motion artifacts, and assess its performance compared to other popular representations.
arXiv Detail & Related papers (2021-11-27T14:33:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.