Related papers: HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

URL: http://arxiv.org/abs/2503.24026v2
Date: Tue, 01 Apr 2025 03:43:35 GMT
Title: HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation
Authors: Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang, Zheng Zhu, Muyang Zhang, Yukun Zhou, Xinze Chen, Guan Huang, Lihong Liu, Xingang Wang,
Abstract summary: We propose a decoupled human video generation framework that first generates diverse poses from text prompts.<n>We present MotionDiT, which is trained to generate structured human-motion poses from text prompts.<n>Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos.
Score: 28.007696532331934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-motion video generation has been a challenging task, primarily due to the difficulty inherent in learning human body movements. While some approaches have attempted to drive human-centric video generation explicitly through pose control, these methods typically rely on poses derived from existing videos, thereby lacking flexibility. To address this, we propose HumanDreamer, a decoupled human video generation framework that first generates diverse poses from text prompts and then leverages these poses to generate human-motion videos. Specifically, we propose MotionVid, the largest dataset for human-motion pose generation. Based on the dataset, we present MotionDiT, which is trained to generate structured human-motion poses from text prompts. Besides, a novel LAMA loss is introduced, which together contribute to a significant improvement in FID by 62.4%, along with respective enhancements in R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby advancing both the Text-to-Pose control accuracy and FID metrics. Our experiments across various Pose-to-Video baselines demonstrate that the poses generated by our method can produce diverse and high-quality human-motion videos. Furthermore, our model can facilitate other downstream tasks, such as pose sequence prediction and 2D-3D motion lifting.

Related papers

CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos [34.06338037793912]
CoMoVi is a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop.<n>In this paper, we propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs.<n>We then design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions.
arXiv Detail & Related papers (2026-01-15T17:52:29Z)
From Generated Human Videos to Physically Plausible Robot Trajectories [103.28274349461607]
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts.<n>To realize this potential, how can a humanoid execute the human actions from generated videos in a zero-shot manner?<n>This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video.<n>We propose GenMimic, a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards.
arXiv Detail & Related papers (2025-12-04T18:56:03Z)
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale [59.36026074638773]
We introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task.<n>We then apply our trained model to 60 hours of Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames.
arXiv Detail & Related papers (2025-12-04T07:34:08Z)
Human Motion Video Generation: A Survey [65.24556163013375]
This paper provides an in-depth survey of human motion video generation, encompassing over ten sub-tasks.<n>It details the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output.<n> Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation.
arXiv Detail & Related papers (2025-09-04T04:39:21Z)
DirectorLLM for Human-Centric Video Generation [46.37441947526771]
We introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos.<n>Our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.
arXiv Detail & Related papers (2024-12-19T03:10:26Z)
Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image.<n>Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z)
Fleximo: Towards Flexible Text-to-Human Motion Video Generation [17.579663311741072]
We introduce a novel task aimed at generating human motion videos solely from reference images and natural language. We propose a new framework called Fleximo, which leverages large-scale pre-trained text-to-3D motion models. To assess the performance of Fleximo, we introduce a new benchmark called MotionBench, which includes 400 videos across 20 identities and 20 motions.
arXiv Detail & Related papers (2024-11-29T04:09:13Z)
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z)
HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation [64.37874983401221]
We present HumanVid, the first large-scale high-quality dataset tailored for human image animation. For the real-world data, we compile a vast collection of real-world videos from the internet. For the synthetic data, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings.
arXiv Detail & Related papers (2024-07-24T17:15:58Z)
Text2Performer: Text-Driven Human Video Generation [97.3849869893433]
Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts.
arXiv Detail & Related papers (2023-04-17T17:59:02Z)
Human Performance Capture from Monocular Video in the Wild [50.34917313325813]
We propose a method capable of capturing the dynamic 3D human shape from a monocular video featuring challenging body poses. Our method outperforms state-of-the-art methods on an in-the-wild human video dataset 3DPW.
arXiv Detail & Related papers (2021-11-29T16:32:41Z)
High-Fidelity Neural Human Motion Transfer from Monocular Video [71.75576402562247]
Video-based human motion transfer creates video animations of humans following a source motion. We present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations. In the experimental results, we significantly outperform the state-of-the-art in terms of video realism.
arXiv Detail & Related papers (2020-12-20T16:54:38Z)
Human Motion Transfer from Poses in the Wild [61.6016458288803]
We tackle the problem of human motion transfer, where we synthesize novel motion video for a target person that imitates the movement from a reference video. It is a video-to-video translation task in which the estimated poses are used to bridge two domains. We introduce a novel pose-to-video translation framework for generating high-quality videos that are temporally coherent even for in-the-wild pose sequences unseen during training.
arXiv Detail & Related papers (2020-04-07T05:59:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.