MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training
- URL: http://arxiv.org/abs/2509.22199v2
- Date: Mon, 29 Sep 2025 05:03:33 GMT
- Title: MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training
- Authors: Haoyun Li, Ivan Zhang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Zhiqin Yang, Zhentao Zhang, Boyuan Wang, Chaojun Ni, Wenkang Qin, Xinze Chen, Yun Ye, Guan Huang, Zhenbo Song, Xingang Wang,
- Abstract summary: We propose MimicDreamer, a framework that turns low-cost human demonstrations into robot-usable supervision.<n>For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos.<n>For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography.<n>For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver.
- Score: 40.45924128424013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Language Action (VLA) models derive their generalization capability from diverse training data, yet collecting embodied robot interaction data remains prohibitively expensive. In contrast, human demonstration videos are far more scalable and cost-efficient to collect, and recent studies confirm their effectiveness in training VLA models. However, a significant domain gap persists between human videos and robot-executed videos, including unstable camera viewpoints, visual discrepancies between human hands and robotic arms, and differences in motion dynamics. To bridge this gap, we propose MimicDreamer, a framework that turns fast, low-cost human demonstrations into robot-usable supervision by jointly aligning vision, viewpoint, and actions to directly support policy training. For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos by transferring motion from human manipulation footage. For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography and inpaints occlusions and distortions caused by warping. For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver to produce feasible, low-jitter joint commands with accurate pose tracking. Empirically, VLA models trained purely on our synthesized human-to-robot videos achieve few-shot execution on real robots. Moreover, scaling training with human data significantly boosts performance compared to models trained solely on real robot data; our approach improves the average success rate by 14.7\% across six representative manipulation tasks.
Related papers
- MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z) - H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos [58.006918399913665]
We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos.<n>Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale.<n>At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions.
arXiv Detail & Related papers (2025-12-10T07:59:45Z) - Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos [42.86535655563404]
We develop a fully-automated holistic human activity analysis approach for arbitrary human hand videos.<n>We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames.<n>We design a dexterous hand VLA model architecture and pretrain the model on this dataset.
arXiv Detail & Related papers (2025-10-24T15:39:31Z) - AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning [5.371855090716962]
Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations.<n>Existing approaches have employed vision-language pretraining with large-scale data.<n>We propose to learn from large-scale human action video datasets in an explicit way.
arXiv Detail & Related papers (2025-08-11T05:09:58Z) - EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos [49.820119587446655]
In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos.<n>With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and convert the human actions to robot actions.<n>We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations.
arXiv Detail & Related papers (2025-07-16T17:27:44Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers [36.497624484863785]
We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions.
Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos.
We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos.
arXiv Detail & Related papers (2024-03-19T17:47:37Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.