A Picture is Worth a Thousand Words: Language Models Plan from Pixels
- URL: http://arxiv.org/abs/2303.09031v1
- Date: Thu, 16 Mar 2023 02:02:18 GMT
- Title: A Picture is Worth a Thousand Words: Language Models Plan from Pixels
- Authors: Anthony Z. Liu, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee
- Abstract summary: Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments.
In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments.
- Score: 53.85753597586226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Planning is an important capability of artificial agents that perform
long-horizon tasks in real-world environments. In this work, we explore the use
of pre-trained language models (PLMs) to reason about plan sequences from text
instructions in embodied visual environments. Prior PLM based approaches for
planning either assume observations are available in the form of text (e.g.,
provided by a captioning model), reason about plans from the instruction alone,
or incorporate information about the visual environment in limited ways (such
as a pre-trained affordance function). In contrast, we show that PLMs can
accurately plan even when observations are directly encoded as input prompts
for the PLM. We show that this simple approach outperforms prior approaches in
experiments on the ALFWorld and VirtualHome benchmarks.
Related papers
- ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.
One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning.
We propose visual-temporal context, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - Planning in the Dark: LLM-Symbolic Planning Pipeline without Experts [34.636688162807836]
Large Language Models (LLMs) have shown promise in solving natural language-described planning tasks, but their direct use often leads to inconsistent reasoning and hallucination.
We propose a novel approach that constructs an action schema library to generate multiple candidates, accounting for the diverse possible interpretations of natural language descriptions.
Experiments showed our pipeline maintains superiority in planning over the direct LLM planning approach.
arXiv Detail & Related papers (2024-09-24T09:33:12Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - CI w/o TN: Context Injection without Task Name for Procedure Planning [4.004155037293416]
Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos.
Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision.
We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
arXiv Detail & Related papers (2024-02-23T19:34:47Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - ProgPrompt: Generating Situated Robot Task Plans using Large Language
Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning.
We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge
for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps.
We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans.
We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.