Related papers: VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation

URL: http://arxiv.org/abs/2405.18156v1
Date: Tue, 28 May 2024 13:18:32 GMT
Title: VividPose: Advancing Stable Video Diffusion for Realistic Human Image Animation
Authors: Qilin Wang, Zhengkai Jiang, Chengming Xu, Jiangning Zhang, Yabiao Wang, Xinyi Zhang, Yun Cao, Weijian Cao, Chengjie Wang, Yanwei Fu,
Abstract summary: We propose VividPose, an end-to-end pipeline that ensures superior temporal stability. An identity-aware appearance controller integrates additional facial information without compromising other appearance details. A geometry-aware pose controller utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset.
Score: 79.99551055245071
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human image animation involves generating a video from a static image by following a specified pose sequence. Current approaches typically adopt a multi-stage pipeline that separately learns appearance and motion, which often leads to appearance degradation and temporal inconsistencies. To address these issues, we propose VividPose, an innovative end-to-end pipeline based on Stable Video Diffusion (SVD) that ensures superior temporal stability. To enhance the retention of human identity, we propose an identity-aware appearance controller that integrates additional facial information without compromising other appearance details such as clothing texture and background. This approach ensures that the generated videos maintain high fidelity to the identity of human subject, preserving key facial features across various poses. To accommodate diverse human body shapes and hand movements, we introduce a geometry-aware pose controller that utilizes both dense rendering maps from SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and shape in the generated videos, providing a robust framework capable of handling a wide range of body shapes and dynamic hand movements. Extensive qualitative and quantitative experiments on the UBCFashion and TikTok benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, VividPose exhibits superior generalization capabilities on our proposed in-the-wild dataset. Codes and models will be available.

Related papers

StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation [98.10527466949338]
Current diffusion models for human image animation often struggle to maintain identity consistency.<n>We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment.<n>We show how StableAnimator++ generates high-quality videos conditioned on a reference image and a pose sequence without any post-processing.
arXiv Detail & Related papers (2025-07-20T17:59:26Z)
HumanDiT: Pose-Guided Diffusion Transformer for Long-form Human Motion Video Generation [39.69554411714128]
We propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a dataset containing 14,000 hours of high-quality video. HumanDiT supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation. Experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.
arXiv Detail & Related papers (2025-02-07T11:36:36Z)
DisPose: Disentangling Pose Guidance for Controllable Human Image Animation [13.366879755548636]
DisPose aims to disentangle the sparse skeleton pose in human image animation into motion field guidance and keypoint correspondence. To seamlessly integrate into existing models, we propose a plug-and-play hybrid ControlNet.
arXiv Detail & Related papers (2024-12-12T15:15:59Z)
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses [57.17501809717155]
We present DreamDance, a novel method for animating human images using only skeleton pose sequences as conditional inputs. Our key insight is that human images naturally exhibit multiple levels of correlation. We construct the TikTok-Dance5K dataset, comprising 5K high-quality dance videos with detailed frame annotations.
arXiv Detail & Related papers (2024-11-30T08:42:13Z)
MIMAFace: Face Animation via Motion-Identity Modulated Appearance Feature Learning [30.61146302275139]
We introduce a Motion-Identity Modulated Appearance Learning Module (MIA) that modulates CLIP features at both motion and identity levels. We also design an Inter-clip Affinity Learning Module (ICA) to model temporal relationships across clips. Our method achieves precise facial motion control (i.e., expressions and gaze), faithful identity preservation, and generates animation videos that maintain both intra/inter-clip temporal consistency.
arXiv Detail & Related papers (2024-09-23T16:33:53Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
Follow-Your-Pose v2: Multiple-Condition Guided Character Image Animation for Stable Pose Control [77.08568533331206]
Follow-Your-Pose v2 can be trained on noisy open-sourced videos readily available on the internet. Our approach outperforms state-of-the-art methods by a margin of over 35% across 2 datasets and on 7 metrics.
arXiv Detail & Related papers (2024-06-05T08:03:18Z)
AniDress: Animatable Loose-Dressed Avatar from Sparse Views Using Garment Rigging Model [58.035758145894846]
We introduce AniDress, a novel method for generating animatable human avatars in loose clothes using very sparse multi-view videos. A pose-driven deformable neural radiance field conditioned on both body and garment motions is introduced, providing explicit control of both parts. Our method is able to render natural garment dynamics that deviate highly from the body and well to generalize to both unseen views and poses.
arXiv Detail & Related papers (2024-01-27T08:48:18Z)
Do You Guys Want to Dance: Zero-Shot Compositional Human Dance Generation with Multiple Persons [73.21855272778616]
We introduce a new task, dataset, and evaluation protocol of compositional human dance generation (cHDG) We propose a novel zero-shot framework, dubbed MultiDance-Zero, that can synthesize videos consistent with arbitrary multiple persons and background while precisely following the driving poses.
arXiv Detail & Related papers (2024-01-24T10:44:16Z)
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [27.700371215886683]
diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. In this paper, we propose a novel framework tailored for character animation. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods.
arXiv Detail & Related papers (2023-11-28T12:27:15Z)
VINECS: Video-based Neural Character Skinning [82.39776643541383]
We propose a fully automated approach for creating a fully rigged character with pose-dependent skinning weights. We show that our approach outperforms state-of-the-art while not relying on dense 4D scans.
arXiv Detail & Related papers (2023-07-03T08:35:53Z)
Video-driven Neural Physically-based Facial Asset for Production [33.24654834163312]
We present a new learning-based, video-driven approach for generating dynamic facial geometries with high-quality physically-based assets. Our technique provides higher accuracy and visual fidelity than previous video-driven facial reconstruction and animation methods.
arXiv Detail & Related papers (2022-02-11T13:22:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.