Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos
- URL: http://arxiv.org/abs/2103.16817v1
- Date: Wed, 31 Mar 2021 05:25:05 GMT
- Title: Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos
- Authors: Annie S. Chen, Suraj Nair, Chelsea Finn
- Abstract summary: Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
- Score: 59.58105314783289
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are motivated by the goal of generalist robots that can complete a wide
range of tasks across many environments. Critical to this is the robot's
ability to acquire some metric of task success or reward, which is necessary
for reinforcement learning, planning, or knowing when to ask for help. For a
general-purpose robot operating in the real world, this reward function must
also be able to generalize broadly across environments, tasks, and objects,
while depending only on on-board sensor observations (e.g. RGB images). While
deep learning on large and diverse datasets has shown promise as a path towards
such generalization in computer vision and natural language, collecting high
quality datasets of robotic interaction at scale remains an open challenge. In
contrast, "in-the-wild" videos of humans (e.g. YouTube) contain an extensive
collection of people doing interesting tasks across a diverse range of
settings. In this work, we propose a simple approach, Domain-agnostic Video
Discriminator (DVD), that learns multitask reward functions by training a
discriminator to classify whether two videos are performing the same task, and
can generalize by virtue of learning from a small amount of robot data with a
broad dataset of human videos. We find that by leveraging diverse human
datasets, this reward function (a) can generalize zero shot to unseen
environments, (b) generalize zero shot to unseen tasks, and (c) can be combined
with visual model predictive control to solve robotic manipulation tasks on a
real WidowX200 robot in an unseen environment from a single human demo.
Related papers
- Towards Generalizable Zero-Shot Manipulation via Translating Human
Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots.
We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations.
We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z) - RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in
One-Shot [56.130215236125224]
A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots.
Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations.
This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception.
arXiv Detail & Related papers (2023-07-02T15:33:31Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks.
One of the key contributing factors to this progress is the scale of robot data used to train the models.
We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.