Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
- URL: http://arxiv.org/abs/2204.01691v1
- Date: Mon, 4 Apr 2022 17:57:11 GMT
- Title: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
- Authors: Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes,
Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex
Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric
Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi,
Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine,
Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka
Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers,
Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu,
Sichun Xu, Mengyuan Yan
- Abstract summary: Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
- Score: 119.29555551279155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models can encode a wealth of semantic knowledge about the
world. Such knowledge could be extremely useful to robots aiming to act upon
high-level, temporally extended instructions expressed in natural language.
However, a significant weakness of language models is that they lack real-world
experience, which makes it difficult to leverage them for decision making
within a given embodiment. For example, asking a language model to describe how
to clean a spill might result in a reasonable narrative, but it may not be
applicable to a particular agent, such as a robot, that needs to perform this
task in a particular environment. We propose to provide real-world grounding by
means of pretrained skills, which are used to constrain the model to propose
natural language actions that are both feasible and contextually appropriate.
The robot can act as the language model's "hands and eyes," while the language
model supplies high-level semantic knowledge about the task. We show how
low-level skills can be combined with large language models so that the
language model provides high-level knowledge about the procedures for
performing complex and temporally-extended instructions, while value functions
associated with these skills provide the grounding necessary to connect this
knowledge to a particular physical environment. We evaluate our method on a
number of real-world robotic tasks, where we show the need for real-world
grounding and that this approach is capable of completing long-horizon,
abstract, natural language instructions on a mobile manipulator. The project's
website and the video can be found at https://say-can.github.io/
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - Navigation with Large Language Models: Semantic Guesswork as a Heuristic
for Planning [73.0990339667978]
Navigation in unfamiliar environments presents a major challenge for robots.
We use language models to bias exploration of novel real-world environments.
We evaluate LFG in challenging real-world environments and simulated benchmarks.
arXiv Detail & Related papers (2023-10-16T06:21:06Z) - Grounded Decoding: Guiding Text Generation with Grounded Models for
Embodied Agents [111.15288256221764]
Grounded-decoding project aims to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives.
We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
arXiv Detail & Related papers (2023-03-01T22:58:50Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - LaTTe: Language Trajectory TransformEr [33.7939079214046]
This work proposes a flexible language-based framework to modify generic 3D robotic trajectories.
We employ an auto-regressive transformer to map natural language inputs and contextual images into changes in 3D trajectories.
We show through simulations and real-life experiments that the model can successfully follow human intent.
arXiv Detail & Related papers (2022-08-04T22:43:21Z) - Inner Monologue: Embodied Reasoning through Planning with Language
Models [81.07216635735571]
Large Language Models (LLMs) can be applied to domains beyond natural language processing.
LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them.
We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios.
arXiv Detail & Related papers (2022-07-12T15:20:48Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - Language Conditioned Imitation Learning over Unstructured Data [9.69886122332044]
We present a method for incorporating free-form natural language conditioning into imitation learning.
Our approach learns perception from pixels, natural language understanding, and multitask continuous control end-to-end as a single neural network.
We show this dramatically improves language conditioned performance, while reducing the cost of language annotation to less than 1% of total data.
arXiv Detail & Related papers (2020-05-15T17:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.