EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
- URL: http://arxiv.org/abs/2507.12440v3
- Date: Fri, 18 Jul 2025 07:18:39 GMT
- Title: EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
- Authors: Ruihan Yang, Qinxi Yu, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang,
- Abstract summary: In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos.<n>With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and convert the human actions to robot actions.<n>We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations.
- Score: 49.820119587446655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA
Related papers
- MiVLA: Towards Generalizable Vision-Language-Action Model with Human-Robot Mutual Imitation Pre-training [102.850162490626]
We propose MiVLA, a vision-language-action model empowered by human-robot mutual imitation pre-training.<n>We show that MiVLA achieves strong improved generalization capability, outperforming state-of-the-art VLAs.
arXiv Detail & Related papers (2025-12-17T12:59:41Z) - H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos [58.006918399913665]
We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos.<n>Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale.<n>At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions.
arXiv Detail & Related papers (2025-12-10T07:59:45Z) - X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale [59.36026074638773]
We introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task.<n>We then apply our trained model to 60 hours of Ego-Exo4D videos, generating and releasing a new large-scale dataset of over 3.6 million "robotized" humanoid video frames.
arXiv Detail & Related papers (2025-12-04T07:34:08Z) - Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos [42.86535655563404]
We develop a fully-automated holistic human activity analysis approach for arbitrary human hand videos.<n>We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames.<n>We design a dexterous hand VLA model architecture and pretrain the model on this dataset.
arXiv Detail & Related papers (2025-10-24T15:39:31Z) - MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training [40.45924128424013]
We propose MimicDreamer, a framework that turns low-cost human demonstrations into robot-usable supervision.<n>For visual alignment, we propose H2R Aligner, a video diffusion model that generates high-fidelity robot demonstration videos.<n>For viewpoint stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos via homography.<n>For action alignment, we map human hand trajectories to the robot frame and apply a constrained inverse kinematics solver.
arXiv Detail & Related papers (2025-09-26T11:05:10Z) - AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning [5.371855090716962]
Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations.<n>Existing approaches have employed vision-language pretraining with large-scale data.<n>We propose to learn from large-scale human action video datasets in an explicit way.
arXiv Detail & Related papers (2025-08-11T05:09:58Z) - EgoZero: Robot Learning from Smart Glasses [54.6168258133554]
EgoZero learns robust manipulation policies from human demonstrations captured with Project Aria smart glasses.<n>We deploy EgoZero policies on a Franka Panda robot and demonstrate zero-shot transfer with 70% success rate over 7 manipulation tasks.<n>Our results suggest that in-the-wild human data can serve as a scalable foundation for real-world robot learning.
arXiv Detail & Related papers (2025-05-26T17:59:17Z) - Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration [21.94699075066712]
Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation.<n>We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies.
arXiv Detail & Related papers (2025-04-17T03:15:20Z) - Humanoid Policy ~ Human Policy [26.01581047414598]
We train a human-humanoid behavior policy, which we term Human Action Transformer (HAT)<n>The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions.<n>We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency.
arXiv Detail & Related papers (2025-03-17T17:59:09Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Visual IRL for Human-Like Robotic Manipulation [5.167226775583172]
We present a novel method for collaborative robots (cobots) to learn manipulation tasks and perform them in a human-like manner.<n>Our method falls under the learn-from-observation (LfO) paradigm, where robots learn to perform tasks by observing human actions.<n>We evaluate the performance of this approach on two different realistic manipulation tasks.
arXiv Detail & Related papers (2024-12-16T01:23:13Z) - EgoMimic: Scaling Imitation Learning via Egocentric Video [22.902881956495765]
We present EgoMimic, a full-stack framework which scales manipulation via human embodiment data.
EgoMimic achieves this through: (1) a system to capture human embodiment data using the ergonomic Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, and (4) an imitation learning architecture that co-trains on human and robot data.
arXiv Detail & Related papers (2024-10-31T17:59:55Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)<n>LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.<n>We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers [36.497624484863785]
We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions.
Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos.
We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos.
arXiv Detail & Related papers (2024-03-19T17:47:37Z) - Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.