Related papers: A Picture is Worth a Thousand Words: Language Models Plan from Pixels

A Picture is Worth a Thousand Words: Language Models Plan from Pixels

URL: http://arxiv.org/abs/2303.09031v1
Date: Thu, 16 Mar 2023 02:02:18 GMT
Title: A Picture is Worth a Thousand Words: Language Models Plan from Pixels
Authors: Anthony Z. Liu, Lajanugen Logeswaran, Sungryull Sohn, Honglak Lee
Abstract summary: Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments. In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments.
Score: 53.85753597586226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments. In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments. Prior PLM based approaches for planning either assume observations are available in the form of text (e.g., provided by a captioning model), reason about plans from the instruction alone, or incorporate information about the visual environment in limited ways (such as a pre-trained affordance function). In contrast, we show that PLMs can accurately plan even when observations are directly encoded as input prompts for the PLM. We show that this simple approach outperforms prior approaches in experiments on the ALFWorld and VirtualHome benchmarks.

Related papers

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation [62.711546725154314]
We introduce Gondola, a grounded vision-language planning model based on large language models (LLMs) for generalizable robotic manipulation.<n>G Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations.<n>G Gondola outperforms the state-of-the-art LLM-based method across all four levels of the GemBench dataset.
arXiv Detail & Related papers (2025-06-12T20:04:31Z)
Visual Planning: Let's Think Only with Images [30.67065689757505]
We argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information.<n>Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations, independent of text.<n>In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.
arXiv Detail & Related papers (2025-05-16T16:17:22Z)
Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models [63.765846080050906]
This paper proposes a novel parameter-efficient action planner using large language models (PEAP-LLM) to generate a single-step instruction at each location.<n>Experiments show the superiority of our proposed model on REVERIE compared to the previous state-of-the-art.
arXiv Detail & Related papers (2025-05-12T12:38:20Z)
VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making [21.61801132083334]
VIPER is a novel framework for multimodal instruction-based planning. It integrates VLM-based perception with LLM-based reasoning. We show that VIPER significantly outperforms state-of-the-art visual instruction-based planners.
arXiv Detail & Related papers (2025-03-19T11:05:42Z)
LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps [18.602777449136738]
We propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks.
arXiv Detail & Related papers (2025-03-15T18:54:06Z)
Plant in Cupboard, Orange on Table, Book on Shelf. Benchmarking Practical Reasoning and Situation Modelling in a Text-Simulated Situated Environment [18.256529559741075]
Large language models (LLMs) have risen to prominence as 'chatbots' for users to interact via natural language. We have implemented a simple text-based environment that simulates, very abstractly, a household setting. Our findings show that environmental complexity and game restrictions hamper performance.
arXiv Detail & Related papers (2025-02-17T12:20:39Z)
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges. One critical issue is bridging the gap between discrete entities in low-level observations and the abstract concepts required for effective planning. We propose visual-temporal context, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z)
Planning in the Dark: LLM-Symbolic Planning Pipeline without Experts [34.636688162807836]
Large Language Models (LLMs) have shown promise in solving natural language-described planning tasks, but their direct use often leads to inconsistent reasoning and hallucination. We propose a novel approach that constructs an action schema library to generate multiple candidates, accounting for the diverse possible interpretations of natural language descriptions. Experiments showed our pipeline maintains superiority in planning over the direct LLM planning approach.
arXiv Detail & Related papers (2024-09-24T09:33:12Z)
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs) One understudied capability inVLMs is visual spatial planning. Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z)
CI w/o TN: Context Injection without Task Name for Procedure Planning [4.004155037293416]
Procedure planning in instructional videos involves creating goal-directed plans based on visual start and goal observations from videos. Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision. We propose a much weaker setting without task name as supervision, which is not currently solvable by existing large language models.
arXiv Detail & Related papers (2024-02-23T19:34:47Z)
Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint. During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations. Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z)
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z)
Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences. In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z)
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [111.33545170562337]
We investigate the possibility of grounding high-level tasks, expressed in natural language, to a chosen set of actionable steps. We find that if pre-trained LMs are large enough and prompted appropriately, they can effectively decompose high-level tasks into low-level plans. We propose a procedure that conditions on existing demonstrations and semantically translates the plans to admissible actions.
arXiv Detail & Related papers (2022-01-18T18:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.