RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild
- URL: http://arxiv.org/abs/2504.14977v1
- Date: Mon, 21 Apr 2025 09:09:21 GMT
- Title: RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild
- Authors: Jingkai Zhou, Yifan Wu, Shikai Li, Min Wei, Chao Fan, Weihua Chen, Wei Jiang, Fan Wang,
- Abstract summary: We introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model.<n>We demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline.<n>In addition, we introduce a new test dataset that captures diverse real-world challenges.
- Score: 27.205550992261493
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Controllable character animation remains a challenging problem, particularly in handling rare poses, stylized characters, character-object interactions, complex illumination, and dynamic scenes. To tackle these issues, prior work has largely focused on injecting pose and appearance guidance via elaborate bypass networks, but often struggles to generalize to open-world scenarios. In this paper, we propose a new perspective that, as long as the foundation model is powerful enough, straightforward model modifications with flexible fine-tuning strategies can largely address the above challenges, taking a step towards controllable character animation in the wild. Specifically, we introduce RealisDance-DiT, built upon the Wan-2.1 video foundation model. Our sufficient analysis reveals that the widely adopted Reference Net design is suboptimal for large-scale DiT models. Instead, we demonstrate that minimal modifications to the foundation model architecture yield a surprisingly strong baseline. We further propose the low-noise warmup and "large batches and small iterations" strategies to accelerate model convergence during fine-tuning while maximally preserving the priors of the foundation model. In addition, we introduce a new test dataset that captures diverse real-world challenges, complementing existing benchmarks such as TikTok dataset and UBC fashion video dataset, to comprehensively evaluate the proposed method. Extensive experiments show that RealisDance-DiT outperforms existing methods by a large margin.
Related papers
- Task-Specific Adaptation with Restricted Model Access [23.114703555189937]
"Gray-box" fine-tuning approaches, where the model's architecture and weights remain hidden, allow only gradient propagation.<n>We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model's input and output.<n>We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment.
arXiv Detail & Related papers (2025-02-02T13:29:44Z) - SMPLest-X: Ultimate Scaling for Expressive Human Pose and Shape Estimation [81.36747103102459]
Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications.<n>Current state-of-the-art methods focus on training innovative architectural designs on confined datasets.<n>We investigate the impact of scaling up EHPS towards a family of generalist foundation models.
arXiv Detail & Related papers (2025-01-16T18:59:46Z) - Boosting Salient Object Detection with Knowledge Distillated from Large Foundation Models [7.898092154590899]
Salient Object Detection aims to identify and segment prominent regions within a scene.<n>Traditional models rely on manually annotated pseudo labels with precise pixel-level accuracy.<n>We develop a low-cost, high-precision annotation method to address the challenges.
arXiv Detail & Related papers (2025-01-08T15:56:21Z) - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image [94.56927147492738]
We introduce GeoWizard, a new generative foundation model designed for estimating geometric attributes from single images.
We show that leveraging diffusion priors can markedly improve generalization, detail preservation, and efficiency in resource usage.
We propose a simple yet effective strategy to segregate the complex data distribution of various scenes into distinct sub-distributions.
arXiv Detail & Related papers (2024-03-18T17:50:41Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Studying How to Efficiently and Effectively Guide Models with Explanations [52.498055901649025]
'Model guidance' is the idea of regularizing the models' explanations to ensure that they are "right for the right reasons"
We conduct an in-depth evaluation across various loss functions, attribution methods, models, and 'guidance depths' on the PASCAL VOC 2007 and MS COCO 2014 datasets.
Specifically, we guide the models via bounding box annotations, which are much cheaper to obtain than the commonly used segmentation masks.
arXiv Detail & Related papers (2023-03-21T15:34:50Z) - Easy Begun is Half Done: Spatial-Temporal Graph Modeling with
ST-Curriculum Dropout [8.924689054841524]
We propose ST-Curriculum Dropout, a novel and easy-to-implement strategy for spatial-temporal graph modeling.
We evaluate the learning difficulty of each node in high-level feature space and drop those difficult ones out to ensure the model only needs to handle fundamental ST relations.
Our strategy can be applied to any canonical deep learning architecture without extra trainable parameters.
arXiv Detail & Related papers (2022-11-28T09:47:46Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - The Little W-Net That Could: State-of-the-Art Retinal Vessel
Segmentation with Minimalistic Models [19.089445797922316]
We show that a minimalistic version of a standard U-Net with several orders of magnitude less parameters closely approximates the performance of current best techniques.
We also propose a simple extension, dubbed W-Net, which reaches outstanding performance on several popular datasets.
We also test our approach on the Artery/Vein segmentation problem, where we again achieve results well-aligned with the state-of-the-art.
arXiv Detail & Related papers (2020-09-03T19:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.