Vision Language Models Cannot Plan, but Can They Formalize?
- URL: http://arxiv.org/abs/2509.21576v1
- Date: Thu, 25 Sep 2025 20:55:11 GMT
- Title: Vision Language Models Cannot Plan, but Can They Formalize?
- Authors: Muyu He, Yuxi Zheng, Yuchen Liu, Zijian An, Bill Cai, Jiani Huang, Lifeng Zhou, Feng Liu, Ziyang Li, Li Zhang,
- Abstract summary: We present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization.<n>We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations.
- Score: 28.52711774279781
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The advancement of vision language models (VLMs) has empowered embodied agents to accomplish simple multimodal planning tasks, but not long-horizon ones requiring long sequences of actions. In text-only simulations, long-horizon planning has seen significant improvement brought by repositioning the role of LLMs. Instead of directly generating action sequences, LLMs translate the planning domain and problem into a formal planning language like the Planning Domain Definition Language (PDDL), which can call a formal solver to derive the plan in a verifiable manner. In multimodal environments, research on VLM-as-formalizer remains scarce, usually involving gross simplifications such as predefined object vocabulary or overly similar few-shot examples. In this work, we present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization. We evaluate those on an existing benchmark while presenting another two that for the first time account for planning with authentic, multi-view, and low-quality images. We conclude that VLM-as-formalizer greatly outperforms end-to-end plan generation. We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations. While generating intermediate, textual representations such as captions or scene graphs partially compensate for the performance, their inconsistent gain leaves headroom for future research directions on multimodal planning formalization.
Related papers
- MATEO: A Multimodal Benchmark for Temporal Reasoning and Planning in LVLMs [2.1793134762413433]
AI agents need to plan to achieve complex goals that involve orchestrating perception, sub-goal decomposition, and execution.<n>Existing research on foundational models' understanding of temporal execution is limited to automatically derived annotations, approximations of the TEO as a linear chain, or text-only inputs.<n>We introduce MATEO, a benchmark designed to assess and improve the temporal reasoning abilities of Large Vision Language Models (LVLMs) required for real-world planning.
arXiv Detail & Related papers (2026-02-16T09:41:50Z) - Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models [53.06230963851451]
JARVIS is a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.<n>We introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs.
arXiv Detail & Related papers (2025-12-17T19:01:34Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - Unifying Inference-Time Planning Language Generation [27.998081619086477]
A line of work in planning uses LLM not to generate a plan, but to generate a formal representation in some planning language.<n>We propose a unifying framework based on intermediate representations.<n>We provide recipes for planning language generation pipelines, draw a series of conclusions showing the efficacy of their various components.
arXiv Detail & Related papers (2025-05-20T17:25:23Z) - On the Limit of Language Models as Planning Formalizers [4.145422873316857]
Large Language Models have been found to create plans that are neither executable nor verifiable in grounded environments.<n>An emerging line of work demonstrates success in using the LLM as a formalizer to generate a formal representation of the planning domain in some language.<n>This formal representation can be deterministically solved to find a plan.
arXiv Detail & Related papers (2024-12-13T05:50:22Z) - Show and Guide: Instructional-Plan Grounded Vision and Language Model [9.84151565227816]
We present MM-PlanLLM, the first multimodal plan-following language model.
We bring cross-modality through two key tasks: Conversational Video Moment Retrieval and Visually-Informed Step Generation.
MM-PlanLLM is trained using a novel multitask-multistage approach.
arXiv Detail & Related papers (2024-09-27T18:20:24Z) - LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds.
Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines.
We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z) - From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world.
Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting.
We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z) - PDDLEGO: Iterative Planning in Textual Environments [56.12148805913657]
Planning in textual environments has been shown to be a long-standing challenge even for current models.
We propose PDDLEGO that iteratively construct a planning representation that can lead to a partial plan for a given sub-goal.
We show that plans produced by few-shot PDDLEGO are 43% more efficient than generating plans end-to-end on the Coin Collector simulation.
arXiv Detail & Related papers (2024-05-30T08:01:20Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon
Sequential Task Planning [7.701407633867452]
Large Language Models (LLMs) offer the potential to enhance the generalizability as task-agnostic planners.
We introduce ISR-LLM, a novel framework that improves LLM-based planning through an iterative self-refinement process.
We show that ISR-LLM is able to achieve markedly higher success rates in task accomplishments compared to state-of-the-art LLM-based planners.
arXiv Detail & Related papers (2023-08-26T01:31:35Z) - AdaPlanner: Adaptive Planning from Feedback with Language Models [56.367020818139665]
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks.
We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback.
To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities.
arXiv Detail & Related papers (2023-05-26T05:52:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.