Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration
- URL: http://arxiv.org/abs/2504.12609v3
- Date: Sat, 16 Aug 2025 04:32:53 GMT
- Title: Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration
- Authors: Tyler Ga Wei Lum, Olivia Y. Lee, C. Karen Liu, Jeannette Bohg,
- Abstract summary: We propose a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task.<n>Human2Sim2Robot outperforms object-aware replay by over 55% and imitation learning by over 68% on grasping, non-prehensile manipulation, and multi-step tasks.
- Score: 21.94699075066712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation, a process that is challenging to scale. Videos of human-object interactions are easier to collect and scale, but leveraging them directly for robot learning is difficult due to the lack of explicit action labels and human-robot embodiment differences. We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies using only one RGB-D video of a human demonstrating a task. Our method utilizes reinforcement learning (RL) in simulation to cross the embodiment gap without relying on wearables, teleoperation, or large-scale data collection. From the video, we extract: (1) the object pose trajectory to define an object-centric, embodiment-agnostic reward, and (2) the pre-manipulation hand pose to initialize and guide exploration during RL training. These components enable effective policy learning without any task-specific reward tuning. In the single human demo regime, Human2Sim2Robot outperforms object-aware replay by over 55% and imitation learning by over 68% on grasping, non-prehensile manipulation, and multi-step tasks. Website: https://human2sim2robot.github.io
Related papers
- H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos [58.006918399913665]
We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos.<n>Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale.<n>At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions.
arXiv Detail & Related papers (2025-12-10T07:59:45Z) - UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations [24.232732907295194]
UniSkill is a framework that learns embodiment-agnostic skill representations from large-scale cross-embodiment video data without any labels.<n>Our experiments in both simulation and real-world environments show that our cross-embodiment skills successfully guide robots in selecting appropriate actions, even with unseen video prompts.
arXiv Detail & Related papers (2025-05-13T17:59:22Z) - X-Sim: Cross-Embodiment Learning via Real-to-Sim-to-Real [20.561250366126625]
X-Sim is a real-to-sim-to-real framework that uses object motion as a dense and transferable signal for learning robot policies.<n>X-Sim starts by reconstructing a simulation from an RGBD human video and tracking object trajectories to define object-centric rewards.<n>The learned policy is then distilled into an image-conditioned diffusion policy using synthetic rollouts rendered with varied viewpoints and lighting.
arXiv Detail & Related papers (2025-05-11T19:04:00Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - XSkill: Cross Embodiment Skill Discovery [41.624343257852146]
XSkill is an imitation learning framework that discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos.
Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate skill transfer and composition for unseen tasks.
arXiv Detail & Related papers (2023-07-19T12:51:28Z) - AR2-D2:Training a Robot Without a Robot [53.10633639596096]
We introduce AR2-D2, a system for collecting demonstrations which does not require people with specialized training.
AR2-D2 is a framework in the form of an iOS app that people can use to record a video of themselves manipulating any object.
We show that data collected via our system enables the training of behavior cloning agents in manipulating real objects.
arXiv Detail & Related papers (2023-06-23T23:54:26Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Learning a Universal Human Prior for Dexterous Manipulation from Human
Preference [35.54663426598218]
We propose a framework that learns a universal human prior using direct human preference feedback over videos.
A task-agnostic reward model is trained through iteratively generating diverse polices and collecting human preference over the trajectories.
Our method empirically demonstrates more human-like behaviors on robot hands in diverse tasks including even unseen tasks.
arXiv Detail & Related papers (2023-04-10T14:17:33Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Human-to-Robot Imitation in the Wild [50.49660984318492]
We propose an efficient one-shot robot learning algorithm, centered around learning from a third-person perspective.
We show one-shot generalization and success in real-world settings, including 20 different manipulation tasks in the wild.
arXiv Detail & Related papers (2022-07-19T17:59:59Z) - Dexterous Imitation Made Easy: A Learning-Based Framework for Efficient
Dexterous Manipulation [13.135013586592585]
'Dexterous Made Easy' (DIME) is a new imitation learning framework for dexterous manipulation.
DIME only requires a single RGB camera to observe a human operator and teleoperate our robotic hand.
On both simulation and real robot benchmarks we demonstrate that DIME can be used to solve complex, in-hand manipulation tasks.
arXiv Detail & Related papers (2022-03-24T17:58:54Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - Learning Object Manipulation Skills via Approximate State Estimation
from Real Videos [47.958512470724926]
Humans are adept at learning new tasks by watching a few instructional videos.
On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain.
In this paper, we explore a method that facilitates learning object manipulation skills directly from videos.
arXiv Detail & Related papers (2020-11-13T08:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.