Related papers: Grounding Language Plans in Demonstrations Through Counterfactual Perturbations

Grounding Language Plans in Demonstrations Through Counterfactual Perturbations

URL: http://arxiv.org/abs/2403.17124v2
Date: Mon, 29 Apr 2024 04:34:52 GMT
Title: Grounding Language Plans in Demonstrations Through Counterfactual Perturbations
Authors: Yanwei Wang, Tsun-Hsuan Wang, Jiayuan Mao, Michael Hagenow, Julie Shah,
Abstract summary: Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks.
Score: 25.19071357445557
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. Whereas prior works have focused on leveraging LLMs directly for planning in symbolic spaces, this work uses LLMs to guide the search of task structures and constraints implicit in multi-step demonstrations. Specifically, we borrow from manipulation planning literature the concept of mode families, which group robot configurations by specific motion constraints, to serve as an abstraction layer between the high-level language representations of an LLM and the low-level physical trajectories of a robot. By replaying a few human demonstrations with synthetic perturbations, we generate coverage over the demonstrations' state space with additional successful executions as well as counterfactuals that fail the task. Our explanation-based learning framework trains an end-to-end differentiable neural network to predict successful trajectories from failures and as a by-product learns classifiers that ground low-level states and images in mode families without dense labeling. The learned grounding classifiers can further be used to translate language plans into reactive policies in the physical domain in an interpretable manner. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks. Website: https://yanweiw.github.io/glide

Related papers

AntiGrounding: Lifting Robotic Actions into VLM Representation Space for Decision Making [35.83717913117858]
AntiGrounding is a new framework that reverses the instruction grounding process.<n>It lifts candidate actions directly into the VLM representation space.<n>It renders trajectories from multiple views, and uses structured visual question answering for instruction-based decision making.
arXiv Detail & Related papers (2025-06-14T07:11:44Z)
Learning Compositional Behaviors from Demonstration and Language [28.352574199884852]
BLADE is a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning.<n>We show significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals.
arXiv Detail & Related papers (2025-05-28T05:19:59Z)
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning [33.441215858388986]
Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, Emma-X. We propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, Emma-X. Emma-X achieves superior performance over competitive baselines, particularly in real-world robotic tasks requiring spatial reasoning.
arXiv Detail & Related papers (2024-12-16T16:58:28Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
From LLMs to Actions: Latent Codes as Bridges in Hierarchical Robot Control [58.72492647570062]
We introduce our method -- Learnable Latent Codes as Bridges (LCB) -- as an alternate architecture to overcome limitations. We find that methodoutperforms baselines that leverage pure language as the interface layer on tasks that require reasoning and multi-step behaviors.
arXiv Detail & Related papers (2024-05-08T04:14:06Z)
Learning with Language-Guided State Abstractions [58.199148890064826]
Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations. Our method, LGA, uses a combination of natural language supervision and background knowledge from language models to automatically build state representations tailored to unseen tasks. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to those designed by humans, but in a fraction of the time.
arXiv Detail & Related papers (2024-02-28T23:57:04Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation. We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects. We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z)
DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment [10.322491116030825]
DoReMi enables Detection and Recovery from Misalignments between plan and execution. Our pipeline can monitor the low-level execution and enable timely recovery if certain plan-execution misalignment occurs.
arXiv Detail & Related papers (2023-07-01T12:51:02Z)
Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents [111.15288256221764]
Grounded-decoding project aims to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
arXiv Detail & Related papers (2023-03-01T22:58:50Z)
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps. We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)
INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter. We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping. We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
Transferable Task Execution from Pixels through Deep Planning Domain Learning [46.88867228115775]
We propose Deep Planning Domain Learning (DPDL) to learn a hierarchical model. DPDL learns a high-level model which predicts values for a set of logical predicates consisting of the current symbolic world state. This allows us to perform complex, multi-step tasks even when the robot has not been explicitly trained on them.
arXiv Detail & Related papers (2020-03-08T05:51:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.