Related papers: HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

URL: http://arxiv.org/abs/2407.17438v3
Date: Thu, 21 Nov 2024 03:26:54 GMT
Title: HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
Authors: Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin,
Abstract summary: We present HumanVid, the first large-scale high-quality dataset tailored for human image animation. For the real-world data, we compile a vast collection of real-world videos from the internet. For the synthetic data, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings.
Score: 64.37874983401221
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human image animation involves generating videos from a character photo, allowing user control and unlocking the potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of real-world videos from the internet. We developed and applied careful filtering rules to ensure video quality, resulting in a curated collection of 20K high-resolution (1080P) human-centric videos. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. To expand our synthetic dataset, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Demo, data and code could be found in the project website: https://humanvid.github.io/.

Related papers

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation [28.007696532331934]
We propose a decoupled human video generation framework that first generates diverse poses from text prompts. We present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos.
arXiv Detail & Related papers (2025-03-31T12:51:45Z)
Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image. Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z)
AvatarGO: Zero-shot 4D Human-Object Interaction Generation and Animation [60.5897687447003]
AvatarGO is a novel framework designed to generate realistic 4D HOI scenes from textual inputs. Our framework not only generates coherent compositional motions, but also exhibits greater robustness in handling issues. As the first attempt to synthesize 4D avatars with object interactions, we hope AvatarGO could open new doors for human-centric 4D content creation.
arXiv Detail & Related papers (2024-10-09T17:58:56Z)
AMG: Avatar Motion Guided Video Generation [5.82136706118236]
We propose AMG, a method that combines the 2D photorealism and 3D controllability by conditioning video diffusion models on controlled rendering of 3D avatars. AMG is the first method that enables multi-person diffusion video generation with precise control over camera positions, human motions, and background style.
arXiv Detail & Related papers (2024-09-02T23:59:01Z)
Decoupling Human and Camera Motion from Videos in the Wild [67.39432972193929]
We propose a method to reconstruct global human trajectories from videos in the wild. Our method decouples the camera and human motion, which allows us to place people in the same world coordinate frame.
arXiv Detail & Related papers (2023-02-24T18:59:15Z)
QS-Craft: Learning to Quantize, Scrabble and Craft for Conditional Human Motion Animation [66.97112599818507]
This paper studies the task of conditional Human Motion Animation (cHMA) Given a source image and a driving video, the model should animate the new frame sequence. The key novelties come from the newly introduced three key steps: quantize, scrabble and craft.
arXiv Detail & Related papers (2022-03-22T11:34:40Z)
Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses. Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z)
Action2video: Generating Videos of Human 3D Actions [31.665831044217363]
We aim to tackle the interesting yet challenging problem of generating videos of diverse and natural human motions from prescribed action categories. Key issue lies in the ability to synthesize multiple distinct motion sequences that are realistic in their visual appearances. Action2motionally generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos.
arXiv Detail & Related papers (2021-11-12T20:20:37Z)
Render In-between: Motion Guided Video Synthesis for Action Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance. A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset. Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.