Do You Guys Want to Dance: Zero-Shot Compositional Human Dance
Generation with Multiple Persons
- URL: http://arxiv.org/abs/2401.13363v1
- Date: Wed, 24 Jan 2024 10:44:16 GMT
- Title: Do You Guys Want to Dance: Zero-Shot Compositional Human Dance
Generation with Multiple Persons
- Authors: Zhe Xu, Kun Wei, Xu Yang, Cheng Deng
- Abstract summary: We introduce a new task, dataset, and evaluation protocol of compositional human dance generation (cHDG)
We propose a novel zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos consistent with arbitrary multiple persons and background while precisely following the driving poses.
- Score: 73.21855272778616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human dance generation (HDG) aims to synthesize realistic videos from images
and sequences of driving poses. Despite great success, existing methods are
limited to generating videos of a single person with specific backgrounds,
while the generalizability for real-world scenarios with multiple persons and
complex backgrounds remains unclear. To systematically measure the
generalizability of HDG models, we introduce a new task, dataset, and
evaluation protocol of compositional human dance generation (cHDG). Evaluating
the state-of-the-art methods on cHDG, we empirically find that they fail to
generalize to real-world scenarios. To tackle the issue, we propose a novel
zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos
consistent with arbitrary multiple persons and background while precisely
following the driving poses. Specifically, in contrast to straightforward DDIM
or null-text inversion, we first present a pose-aware inversion method to
obtain the noisy latent code and initialization text embeddings, which can
accurately reconstruct the composed reference image. Since directly generating
videos from them will lead to severe appearance inconsistency, we propose a
compositional augmentation strategy to generate augmented images and utilize
them to optimize a set of generalizable text embeddings. In addition,
consistency-guided sampling is elaborated to encourage the background and
keypoints of the estimated clean image at each reverse step to be close to
those of the reference image, further improving the temporal consistency of
generated videos. Extensive qualitative and quantitative results demonstrate
the effectiveness and superiority of our approach.
Related papers
- GAS: Generative Avatar Synthesis from a Single Image [54.95198111659466]
We introduce a generalizable and unified framework to synthesize view-consistent and temporally coherent avatars from a single image.
Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model.
arXiv Detail & Related papers (2025-02-10T19:00:39Z) - Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos [12.19207713016543]
Recent setups on dynamic neural field assume input from multi-view videos with known poses.
We show that unchronized videos with unknown poses can generate dynamic neural fields if capture stabilizes the video.
arXiv Detail & Related papers (2024-12-26T07:04:20Z) - CFSynthesis: Controllable and Free-view 3D Human Video Synthesis [57.561237409603066]
CFSynthesis is a novel framework for generating high-quality human videos with customizable attributes.
Our method leverages a texture-SMPL-based representation to ensure consistent and stable character appearances across free viewpoints.
Results on multiple datasets show that CFSynthesis achieves state-of-the-art performance in complex human animations.
arXiv Detail & Related papers (2024-12-15T05:57:36Z) - Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE [22.072200443502457]
We propose Scene123, a 3D scene generation model that ensures realism and diversity through the video generation framework.
Specifically, we warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model.
To further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model.
arXiv Detail & Related papers (2024-08-10T08:09:57Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis [124.48519390371636]
Transfering human motion from a source to a target person poses great potential in computer vision and graphics applications.
Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person.
This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person.
arXiv Detail & Related papers (2021-10-27T03:42:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.