Open-World Object Manipulation using Pre-trained Vision-Language Models
- URL: http://arxiv.org/abs/2303.00905v2
- Date: Wed, 25 Oct 2023 21:45:24 GMT
- Title: Open-World Object Manipulation using Pre-trained Vision-Language Models
- Authors: Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei
Lee, Quan Vuong, Paul Wohlhart, Sean Kirmani, Brianna Zitkovich, Fei Xia,
Chelsea Finn, Karol Hausman
- Abstract summary: For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
- Score: 72.87306011500084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For robots to follow instructions from people, they must be able to connect
the rich semantic information in human vocabulary, e.g. "can you get me the
pink stuffed whale?" to their sensory observations and actions. This brings up
a notably difficult challenge for robots: while robot learning approaches allow
robots to learn many different behaviors from first-hand experience, it is
impractical for robots to have first-hand experiences that span all of this
semantic information. We would like a robot's policy to be able to perceive and
pick up the pink stuffed whale, even if it has never seen any data interacting
with a stuffed whale before. Fortunately, static data on the internet has vast
semantic information, and this information is captured in pre-trained
vision-language models. In this paper, we study whether we can interface robot
policies with these pre-trained models, with the aim of allowing robots to
complete instructions involving object categories that the robot has never seen
first-hand. We develop a simple approach, which we call Manipulation of
Open-World Objects (MOO), which leverages a pre-trained vision-language model
to extract object-identifying information from the language command and image,
and conditions the robot policy on the current image, the instruction, and the
extracted object information. In a variety of experiments on a real mobile
manipulator, we find that MOO generalizes zero-shot to a wide range of novel
object categories and environments. In addition, we show how MOO generalizes to
other, non-language-based input modalities to specify the object of interest
such as finger pointing, and how it can be further extended to enable
open-world navigation and manipulation. The project's website and evaluation
videos can be found at https://robot-moo.github.io/
Related papers
- Towards Generalizable Zero-Shot Manipulation via Translating Human
Interaction Plans [58.27029676638521]
We show how passive human videos can serve as a rich source of data for learning such generalist robots.
We learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations.
We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects.
arXiv Detail & Related papers (2023-12-01T18:54:12Z) - Structured World Models from Human Videos [45.08503470821952]
We tackle the problem of learning complex, general behaviors directly in the real world.
We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories.
arXiv Detail & Related papers (2023-08-21T17:59:32Z) - Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks.
One of the key contributing factors to this progress is the scale of robot data used to train the models.
We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.