LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning
- URL: http://arxiv.org/abs/2507.08496v1
- Date: Fri, 11 Jul 2025 11:18:49 GMT
- Title: LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning
- Authors: Shibo Sun, Xue Li, Donglin Di, Mingjie Wei, Lanshun Nie, Wei-Nan Zhang, Dechen Zhan, Yang Song, Lei Fan,
- Abstract summary: We introduce LLaPa, a vision-language model framework designed for multimodal procedural planning.<n>LLaPa generates executable action sequences from textual task descriptions and visual environmental images.<n>We enhance LLaPa with two auxiliary modules to improve procedural planning.
- Score: 26.098281158573748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model's reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available https://github.com/sunshibo1234/LLaPa.
Related papers
- Language-Vision Planner and Executor for Text-to-Visual Reasoning [9.140712714337273]
This paper presents an AI system that can create a step-by-step visual reasoning plan with an easy-to-understand script and execute each step of the plan in real time.<n>Inspired by recent development in large language models (LLMs) for visual reasoning, this paper presents VLAgent, an AI system that can create a step-by-step visual reasoning plan with an easy-to-understand script and execute each step of the plan in real time.
arXiv Detail & Related papers (2025-06-09T13:55:55Z) - Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models [63.765846080050906]
This paper proposes a novel parameter-efficient action planner using large language models (PEAP-LLM) to generate a single-step instruction at each location.<n>Experiments show the superiority of our proposed model on REVERIE compared to the previous state-of-the-art.
arXiv Detail & Related papers (2025-05-12T12:38:20Z) - PlanLLM: Video Procedure Planning with Refinable Large Language Models [5.371855090716962]
Video procedure planning, i.e., planning a sequence of action steps given the video frames of start and goal states, is an essential ability for embodied AI.<n>Recent works utilize Large Language Models (LLMs) to generate enriched action step description texts to guide action step decoding.<n>We propose PlanLLM, a cross-modal joint learning framework with LLMs for video procedure planning.
arXiv Detail & Related papers (2024-12-26T09:51:05Z) - Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making.
Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance.
We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z) - Show and Guide: Instructional-Plan Grounded Vision and Language Model [9.84151565227816]
We present MM-PlanLLM, the first multimodal plan-following language model.
We bring cross-modality through two key tasks: Conversational Video Moment Retrieval and Visually-Informed Step Generation.
MM-PlanLLM is trained using a novel multitask-multistage approach.
arXiv Detail & Related papers (2024-09-27T18:20:24Z) - ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning [27.725814615823687]
We propose a "plug-and-play" method, ExoViP, to correct errors in both the planning and execution stages.
We employ verification modules as "exoskeletons" to enhance current vision-language programming schemes.
arXiv Detail & Related papers (2024-08-05T03:22:10Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds.
Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines.
We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.