One-Shot Imitation under Mismatched Execution
- URL: http://arxiv.org/abs/2409.06615v6
- Date: Fri, 28 Mar 2025 22:16:37 GMT
- Title: One-Shot Imitation under Mismatched Execution
- Authors: Kushal Kedia, Prithwish Dan, Angela Chao, Maximus Adrian Pace, Sanjiban Choudhury,
- Abstract summary: Human demonstrations are a powerful way to program robots to do long-horizon manipulation tasks.<n> translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities.<n>We propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions.
- Score: 7.060120660671016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human demonstrations as prompts are a powerful way to program robots to do long-horizon manipulation tasks. However, translating these demonstrations into robot-executable actions presents significant challenges due to execution mismatches in movement styles and physical capabilities. Existing methods for human-robot translation either depend on paired data, which is infeasible to scale, or rely heavily on frame-level visual similarities that often break down in practice. To address these challenges, we propose RHyME, a novel framework that automatically pairs human and robot trajectories using sequence-level optimal transport cost functions. Given long-horizon robot demonstrations, RHyME synthesizes semantically equivalent human videos by retrieving and composing short-horizon human clips. This approach facilitates effective policy training without the need for paired data. RHyME successfully imitates a range of cross-embodiment demonstrators, both in simulation and with a real human hand, achieving over 50% increase in task success compared to previous methods. We release our code and datasets at https://portal-cornell.github.io/rhyme/.
Related papers
- Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration [21.94699075066712]
Teaching robots dexterous manipulation skills often requires collecting hundreds of demonstrations using wearables or teleoperation.
We propose Human2Sim2Robot, a novel real-to-sim-to-real framework for training dexterous manipulation policies.
arXiv Detail & Related papers (2025-04-17T03:15:20Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.
VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - AnyDexGrasp: General Dexterous Grasping for Different Hands with Human-level Learning Efficiency [49.868970174484204]
We introduce an efficient approach for learning dexterous grasping with minimal data.
Our method achieves high performance with human-level learning efficiency: only hundreds of grasp attempts on 40 training objects.
This method demonstrates promising applications for humanoid robots, prosthetics, and other domains requiring robust, versatile robotic manipulation.
arXiv Detail & Related papers (2025-02-23T03:26:06Z) - Motion Tracks: A Unified Representation for Human-Robot Transfer in Few-Shot Imitation Learning [40.43176821917154]
We propose to represent actions as short-horizon 2D trajectories on an image.
These actions, or motion tracks, capture the predicted direction of motion for either human hands or robot end-effectors.
We instantiate an IL policy called Motion Track Policy (MT-pi) which receives image observations and outputs motion tracks as actions.
arXiv Detail & Related papers (2025-01-13T01:01:44Z) - DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning [42.88605563822155]
We present a large-scale automated data generation system that synthesizes trajectories from human demonstrations for humanoid robots with dexterous hands.
We generate 21K demos across these tasks from just 60 source human demos.
We also present a real-to-sim-to-real pipeline and deploy it on a real-world humanoid can sorting task.
arXiv Detail & Related papers (2024-10-31T17:48:45Z) - HRP: Human Affordances for Robotic Pre-Training [15.92416819748365]
We present a framework for pre-training representations on hand, object, and contact.
We experimentally demonstrate (using 3000+ robot trials) that this affordance pre-training scheme boosts performance by a minimum of 15% on 5 real-world tasks.
arXiv Detail & Related papers (2024-07-26T17:59:52Z) - HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation [50.616995671367704]
We present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands.
Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning approach achieves superior performance when supported by robust low-level policies.
arXiv Detail & Related papers (2024-03-15T17:45:44Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - SynH2R: Synthesizing Hand-Object Motions for Learning Human-to-Robot
Handovers [37.49601724575655]
Vision-based human-to-robot handover is an important and challenging task in human-robot interaction.
We introduce a framework that can generate plausible human grasping motions suitable for training the robot.
This allows us to generate synthetic training and testing data with 100x more objects than previous work.
arXiv Detail & Related papers (2023-11-09T18:57:02Z) - Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Cross-Domain Transfer via Semantic Skill Imitation [49.83150463391275]
We propose an approach for semantic imitation, which uses demonstrations from a source domain, e.g. human videos, to accelerate reinforcement learning (RL)
Instead of imitating low-level actions like joint velocities, our approach imitates the sequence of demonstrated semantic skills like "opening the microwave" or "turning on the stove"
arXiv Detail & Related papers (2022-12-14T18:46:14Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Learning Preferences for Interactive Autonomy [1.90365714903665]
This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities.
We first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function.
arXiv Detail & Related papers (2022-10-19T21:34:51Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.