Structured World Models from Human Videos
- URL: http://arxiv.org/abs/2308.10901v1
- Date: Mon, 21 Aug 2023 17:59:32 GMT
- Title: Structured World Models from Human Videos
- Authors: Russell Mendonca, Shikhar Bahl, Deepak Pathak
- Abstract summary: We tackle the problem of learning complex, general behaviors directly in the real world.
We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories.
- Score: 45.08503470821952
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the problem of learning complex, general behaviors directly in the
real world. We propose an approach for robots to efficiently learn manipulation
skills using only a handful of real-world interaction trajectories from many
different settings. Inspired by the success of learning from large-scale
datasets in the fields of computer vision and natural language, our belief is
that in order to efficiently learn, a robot must be able to leverage
internet-scale, human video data. Humans interact with the world in many
interesting ways, which can allow a robot to not only build an understanding of
useful actions and affordances but also how these actions affect the world for
manipulation. Our approach builds a structured, human-centric action space
grounded in visual affordances learned from human videos. Further, we train a
world model on human videos and fine-tune on a small amount of robot
interaction data without any task supervision. We show that this approach of
affordance-space world models enables different robots to learn various
manipulation skills in complex settings, in under 30 minutes of interaction.
Videos can be found at https://human-world-model.github.io
Related papers
- Towards Generalizable Zero-Shot Manipulation via Translating Human
Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots.
We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations.
We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z) - Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Human-to-Robot Imitation in the Wild [50.49660984318492]
We propose an efficient one-shot robot learning algorithm, centered around learning from a third-person perspective.
We show one-shot generalization and success in real-world settings, including 20 different manipulation tasks in the wild.
arXiv Detail & Related papers (2022-07-19T17:59:59Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.