On Grounded Planning for Embodied Tasks with Language Models
- URL: http://arxiv.org/abs/2209.00465v3
- Date: Sat, 15 Jul 2023 10:04:08 GMT
- Title: On Grounded Planning for Embodied Tasks with Language Models
- Authors: Bill Yuchen Lin, Chengsong Huang, Qian Liu, Wenda Gu, Sam Sommerer,
Xiang Ren
- Abstract summary: Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world.
It remains unclear **whether LMs have the capacity to generate grounded, executable plans for embodied tasks.
This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback from the physical environment.
- Score: 30.217305215259277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) have demonstrated their capability in possessing
commonsense knowledge of the physical world, a crucial aspect of performing
tasks in everyday life. However, it remains unclear **whether LMs have the
capacity to generate grounded, executable plans for embodied tasks.** This is a
challenging task as LMs lack the ability to perceive the environment through
vision and feedback from the physical environment. In this paper, we address
this important research question and present the first investigation into the
topic. Our novel problem formulation, named **G-PlanET**, inputs a high-level
goal and a data table about objects in a specific environment, and then outputs
a step-by-step actionable plan for a robotic agent to follow. To facilitate the
study, we establish an **evaluation protocol** and design a dedicated metric to
assess the quality of the plans. Our experiments demonstrate that the use of
tables for encoding the environment and an iterative decoding strategy can
significantly enhance the LMs' ability in grounded planning. Our analysis also
reveals interesting and non-trivial findings.
Related papers
- Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators [34.28879194786174]
Generalizable robotic mobile manipulation in open-world environments poses significant challenges due to long horizons, complex goals, and partial observability.
A promising approach to address these challenges involves planning with a library of parameterized skills, where a task planner sequences these skills to achieve goals specified in structured languages.
This paper introduces a novel framework that leverages vision-language models to estimate uncertainty and facilitate symbolic grounding.
arXiv Detail & Related papers (2025-04-04T07:48:53Z) - ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [39.606908488885125]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)
It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.
Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z) - Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos [48.15438373870542]
VidAssist is an integrated framework designed for zero/few-shot goal-oriented planning in instructional videos.
It employs a breadth-first search algorithm for optimal plan generation.
Experiments demonstrate that VidAssist offers a unified framework for different goal-oriented planning setups.
arXiv Detail & Related papers (2024-09-30T17:57:28Z) - AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation [89.68433168477227]
Large Language Model (LLM) based agents have garnered significant attention and are becoming increasingly popular.
This paper investigates enhancing the planning abilities of LLMs through instruction tuning.
To address this limitation, this paper explores the automated synthesis of diverse environments and a gradual range of planning tasks.
arXiv Detail & Related papers (2024-08-01T17:59:46Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - Agent Planning with World Knowledge Model [88.4897773735576]
We introduce parametric World Knowledge Model (WKM) to facilitate agent planning.
We develop WKM, providing prior task knowledge to guide the global planning and dynamic state knowledge to assist the local planning.
Our method can achieve superior performance compared to various strong baselines.
arXiv Detail & Related papers (2024-05-23T06:03:19Z) - PARADISE: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset [0.0]
We present PARADISE, an abductive reasoning task using Q&A format on practical procedural text sourced from wikiHow.
It involves warning and tip inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal.
Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios.
arXiv Detail & Related papers (2024-03-05T18:01:59Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning [32.045840007623276]
We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning.
ViLa directly integrates perceptual data into its reasoning and planning process.
Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
arXiv Detail & Related papers (2023-11-29T17:46:25Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - Learning to Reason over Scene Graphs: A Case Study of Finetuning GPT-2
into a Robot Language Model for Grounded Task Planning [45.51792981370957]
We investigate the applicability of a smaller class of large language models (LLMs) in robotic task planning by learning to decompose tasks into subgoal specifications for a planner to execute sequentially.
Our method grounds the input of the LLM on the domain that is represented as a scene graph, enabling it to translate human requests into executable robot plans.
Our findings suggest that the knowledge stored in an LLM can be effectively grounded to perform long-horizon task planning, demonstrating the promising potential for the future application of neuro-symbolic planning methods in robotics.
arXiv Detail & Related papers (2023-05-12T18:14:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.