Related papers: From Generated Human Videos to Physically Plausible Robot Trajectories

From Generated Human Videos to Physically Plausible Robot Trajectories

URL: http://arxiv.org/abs/2512.05094v1
Date: Thu, 04 Dec 2025 18:56:03 GMT
Title: From Generated Human Videos to Physically Plausible Robot Trajectories
Authors: James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, Roei Herzig,
Abstract summary: Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts.<n>To realize this potential, how can a humanoid execute the human actions from generated videos in a zero-shot manner?<n>This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video.<n>We propose GenMimic, a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards.
Score: 103.28274349461607
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions from noisy, generated videos. We curate GenMimicBench, a synthetic human-motion dataset generated using two video generation models across a spectrum of actions and contexts, establishing a benchmark for assessing zero-shot generalization and policy robustness. Extensive experiments demonstrate improvements over strong baselines in simulation and confirm coherent, physically stable motion tracking on a Unitree G1 humanoid robot without fine-tuning. This work offers a promising path to realizing the potential of video generation models as high-level policies for robot control.

Related papers

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos [58.006918399913665]
We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos.<n>Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale.<n>At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions.
arXiv Detail & Related papers (2025-12-10T07:59:45Z)
X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale [59.36026074638773]
We introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task.<n>We then apply our trained model to 60 hours of Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames.
arXiv Detail & Related papers (2025-12-04T07:34:08Z)
Robot Learning from a Physical World Model [33.89964002945721]
We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling.<n>Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches.
arXiv Detail & Related papers (2025-11-10T18:59:07Z)
MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training [40.45924128424013]
We propose MimicDreamer, a framework that turns low-cost human demonstrations into robot-usable supervision.<n>For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos.<n>For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography.<n>For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver.
arXiv Detail & Related papers (2025-09-26T11:05:10Z)
DreamGen: Unlocking Generalization in Robot Learning through Video World Models [120.25799361925387]
DreamGen is a pipeline for training robot policies that generalize across behaviors and environments through neural trajectories.<n>Our work establishes a promising new axis for scaling robot learning well beyond manual data collection.
arXiv Detail & Related papers (2025-05-19T04:55:39Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation [74.70013315714336]
Gen2Act casts language-conditioned manipulation as zero-shot human video generation followed by execution with a single policy conditioned on the generated video. Our results on diverse real-world scenarios show how Gen2Act enables manipulating unseen object types and performing novel motions for tasks not present in the robot data.
arXiv Detail & Related papers (2024-09-24T17:57:33Z)
Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers [36.497624484863785]
We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions. Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos. We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos.
arXiv Detail & Related papers (2024-03-19T17:47:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.