Do You Guys Want to Dance: Zero-Shot Compositional Human Dance
Generation with Multiple Persons
- URL: http://arxiv.org/abs/2401.13363v1
- Date: Wed, 24 Jan 2024 10:44:16 GMT
- Title: Do You Guys Want to Dance: Zero-Shot Compositional Human Dance
Generation with Multiple Persons
- Authors: Zhe Xu, Kun Wei, Xu Yang, Cheng Deng
- Abstract summary: We introduce a new task, dataset, and evaluation protocol of compositional human dance generation (cHDG)
We propose a novel zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos consistent with arbitrary multiple persons and background while precisely following the driving poses.
- Score: 73.21855272778616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human dance generation (HDG) aims to synthesize realistic videos from images
and sequences of driving poses. Despite great success, existing methods are
limited to generating videos of a single person with specific backgrounds,
while the generalizability for real-world scenarios with multiple persons and
complex backgrounds remains unclear. To systematically measure the
generalizability of HDG models, we introduce a new task, dataset, and
evaluation protocol of compositional human dance generation (cHDG). Evaluating
the state-of-the-art methods on cHDG, we empirically find that they fail to
generalize to real-world scenarios. To tackle the issue, we propose a novel
zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos
consistent with arbitrary multiple persons and background while precisely
following the driving poses. Specifically, in contrast to straightforward DDIM
or null-text inversion, we first present a pose-aware inversion method to
obtain the noisy latent code and initialization text embeddings, which can
accurately reconstruct the composed reference image. Since directly generating
videos from them will lead to severe appearance inconsistency, we propose a
compositional augmentation strategy to generate augmented images and utilize
them to optimize a set of generalizable text embeddings. In addition,
consistency-guided sampling is elaborated to encourage the background and
keypoints of the estimated clean image at each reverse step to be close to
those of the reference image, further improving the temporal consistency of
generated videos. Extensive qualitative and quantitative results demonstrate
the effectiveness and superiority of our approach.
Related papers
- Scene123: One Prompt to 3D Scene Generation via Video-Assisted and Consistency-Enhanced MAE [22.072200443502457]
We propose Scene123, a 3D scene generation model that ensures realism and diversity through the video generation framework.
Specifically, we warp the input image (or an image generated from text) to simulate adjacent views, filling the invisible areas with the MAE model.
To further enhance the details and texture fidelity of generated views, we employ a GAN-based Loss against images derived from the input image through the video generation model.
arXiv Detail & Related papers (2024-08-10T08:09:57Z) - MultiDiff: Consistent Novel View Synthesis from a Single Image [60.04215655745264]
MultiDiff is a novel approach for consistent novel view synthesis of scenes from a single RGB image.
Our results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet.
arXiv Detail & Related papers (2024-06-26T17:53:51Z) - MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild [32.6521941706907]
We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.
We first define a layered neural representation for the entire scene, composited by individual human and background models.
We learn the layered neural representation from videos via our layer-wise differentiable volume rendering.
arXiv Detail & Related papers (2024-06-03T17:59:57Z) - VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation [79.99551055245071]
We propose VividPose, an end-to-end pipeline that ensures superior temporal stability.
An identity-aware appearance controller integrates additional facial information without compromising other appearance details.
A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps.
VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
arXiv Detail & Related papers (2024-05-28T13:18:32Z) - Real-Time Neural Character Rendering with Pose-Guided Multiplane Images [75.62730144924566]
We propose pose-guided multiplane image (MPI) synthesis which can render an animatable character in real scenes with photorealistic quality.
We use a portable camera rig to capture the multi-view images along with the driving signal for the moving subject.
arXiv Detail & Related papers (2022-04-25T17:51:38Z) - Human View Synthesis using a Single Sparse RGB-D Input [16.764379184593256]
We present a novel view synthesis framework to generate realistic renders from unseen views of any human captured from a single-view sensor with sparse RGB-D.
An enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details.
arXiv Detail & Related papers (2021-12-27T20:13:53Z) - A Shared Representation for Photorealistic Driving Simulators [83.5985178314263]
We propose to improve the quality of generated images by rethinking the discriminator architecture.
The focus is on the class of problems where images are generated given semantic inputs, such as scene segmentation maps or human body poses.
We aim to learn a shared latent representation that encodes enough information to jointly do semantic segmentation, content reconstruction, along with a coarse-to-fine grained adversarial reasoning.
arXiv Detail & Related papers (2021-12-09T18:59:21Z) - Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis [124.48519390371636]
Transfering human motion from a source to a target person poses great potential in computer vision and graphics applications.
Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person.
This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person.
arXiv Detail & Related papers (2021-10-27T03:42:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.