Grounded Decoding: Guiding Text Generation with Grounded Models for
Embodied Agents
- URL: http://arxiv.org/abs/2303.00855v2
- Date: Mon, 11 Dec 2023 20:58:05 GMT
- Title: Grounded Decoding: Guiding Text Generation with Grounded Models for
Embodied Agents
- Authors: Wenlong Huang, Fei Xia, Dhruv Shah, Danny Driess, Andy Zeng, Yao Lu,
Pete Florence, Igor Mordatch, Sergey Levine, Karol Hausman, Brian Ichter
- Abstract summary: Grounded-decoding project aims to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives.
We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
- Score: 111.15288256221764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in large language models (LLMs) has demonstrated the ability
to learn and leverage Internet-scale knowledge through pre-training with
autoregressive models. Unfortunately, applying such models to settings with
embodied agents, such as robots, is challenging due to their lack of experience
with the physical world, inability to parse non-language observations, and
ignorance of rewards or safety constraints that robots may require. On the
other hand, language-conditioned robotic policies that learn from interaction
data can provide the necessary grounding that allows the agent to be correctly
situated in the real world, but such policies are limited by the lack of
high-level semantic understanding due to the limited breadth of the interaction
data available for training them. Thus, if we want to make use of the semantic
knowledge in a language model while still situating it in an embodied setting,
we must construct an action sequence that is both likely according to the
language model and also realizable according to grounded models of the
environment. We frame this as a problem similar to probabilistic filtering:
decode a sequence that both has high probability under the language model and
high probability under a set of grounded model objectives. We demonstrate how
such grounded models can be obtained across three simulation and real-world
domains, and that the proposed decoding strategy is able to solve complex,
long-horizon embodiment tasks in a robotic setting by leveraging the knowledge
of both models. The project's website can be found at
grounded-decoding.github.io.
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - Grounding Language Plans in Demonstrations Through Counterfactual Perturbations [25.19071357445557]
Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI.
We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks.
arXiv Detail & Related papers (2024-03-25T19:04:59Z) - Navigation with Large Language Models: Semantic Guesswork as a Heuristic
for Planning [73.0990339667978]
Navigation in unfamiliar environments presents a major challenge for robots.
We use language models to bias exploration of novel real-world environments.
We evaluate LFG in challenging real-world environments and simulated benchmarks.
arXiv Detail & Related papers (2023-10-16T06:21:06Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - LaTTe: Language Trajectory TransformEr [33.7939079214046]
This work proposes a flexible language-based framework to modify generic 3D robotic trajectories.
We employ an auto-regressive transformer to map natural language inputs and contextual images into changes in 3D trajectories.
We show through simulations and real-life experiments that the model can successfully follow human intent.
arXiv Detail & Related papers (2022-08-04T22:43:21Z) - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [119.29555551279155]
Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
arXiv Detail & Related papers (2022-04-04T17:57:11Z) - CAZSL: Zero-Shot Regression for Pushing Models by Generalizing Through
Context [13.217582954907234]
We study the problem of designing deep learning agents which can generalize their models of the physical world by building context-aware models.
We present context-aware zero shot learning (CAZSL, pronounced as casual) models, an approach utilizing a Siamese network, embedding space and regularization based on context variables.
We test our proposed learning algorithm on the recently released Omnipush datatset that allows testing of meta-learning capabilities.
arXiv Detail & Related papers (2020-03-26T01:21:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.