Bridging the Human to Robot Dexterity Gap through Object-Oriented Rewards
- URL: http://arxiv.org/abs/2410.23289v1
- Date: Wed, 30 Oct 2024 17:59:41 GMT
- Title: Bridging the Human to Robot Dexterity Gap through Object-Oriented Rewards
- Authors: Irmak Guzey, Yinlong Dai, Georgy Savva, Raunaq Bhirangi, Lerrel Pinto,
- Abstract summary: Training robots directly from human videos is an emerging area in robotics and computer vision.
A key reason for this difficulty is that a policy trained on human hands may not directly transfer to a robot hand due to morphology differences.
We present HuDOR, a technique that enables online fine-tuning of policies by directly computing rewards from human videos.
- Score: 15.605887551756934
- License:
- Abstract: Training robots directly from human videos is an emerging area in robotics and computer vision. While there has been notable progress with two-fingered grippers, learning autonomous tasks for multi-fingered robot hands in this way remains challenging. A key reason for this difficulty is that a policy trained on human hands may not directly transfer to a robot hand due to morphology differences. In this work, we present HuDOR, a technique that enables online fine-tuning of policies by directly computing rewards from human videos. Importantly, this reward function is built using object-oriented trajectories derived from off-the-shelf point trackers, providing meaningful learning signals despite the morphology gap and visual differences between human and robot hands. Given a single video of a human solving a task, such as gently opening a music box, HuDOR enables our four-fingered Allegro hand to learn the task with just an hour of online interaction. Our experiments across four tasks show that HuDOR achieves a 4x improvement over baselines. Code and videos are available on our website, https://object-rewards.github.io.
Related papers
- Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Zero-Shot Robot Manipulation from Passive Human Videos [59.193076151832145]
We develop a framework for extracting agent-agnostic action representations from human videos.
Our framework is based on predicting plausible human hand trajectories.
We deploy the trained model zero-shot for physical robot manipulation tasks.
arXiv Detail & Related papers (2023-02-03T21:39:52Z) - VideoDex: Learning Dexterity from Internet Videos [27.49510986378025]
We propose leveraging the next best thing as real-world experience: internet videos of humans using their hands.
Visual priors, such as visual features, are often learned from videos, but more information from videos can be utilized as a stronger prior.
We build a learning algorithm, VideoDex, that leverages visual, action, and physical priors from human video datasets to guide robot behavior.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - HERD: Continuous Human-to-Robot Evolution for Learning from Human
Demonstration [57.045140028275036]
We show that manipulation skills can be transferred from a human to a robot through the use of micro-evolutionary reinforcement learning.
We propose an algorithm for multi-dimensional evolution path searching that allows joint optimization of both the robot evolution path and the policy.
arXiv Detail & Related papers (2022-12-08T15:56:13Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Robotic Telekinesis: Learning a Robotic Hand Imitator by Watching Humans
on Youtube [24.530131506065164]
We build a system that enables any human to control a robot hand and arm, simply by demonstrating motions with their own hand.
The robot observes the human operator via a single RGB camera and imitates their actions in real-time.
We leverage this data to train a system that understands human hands and retargets a human video stream into a robot hand-arm trajectory that is smooth, swift, safe, and semantically similar to the guiding demonstration.
arXiv Detail & Related papers (2022-02-21T18:59:59Z) - DexVIP: Learning Dexterous Grasping with Human Hand Pose Priors from
Video [86.49357517864937]
We propose DexVIP, an approach to learn dexterous robotic grasping from human-object interaction videos.
We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent's hand pose.
We demonstrate that DexVIP compares favorably to existing approaches that lack a hand pose prior or rely on specialized tele-operation equipment.
arXiv Detail & Related papers (2022-02-01T00:45:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.