World-aware Planning Narratives Enhance Large Vision-Language Model Planner
- URL: http://arxiv.org/abs/2506.21230v2
- Date: Wed, 02 Jul 2025 15:03:26 GMT
- Title: World-aware Planning Narratives Enhance Large Vision-Language Model Planner
- Authors: Junhao Shi, Zhaoye Fei, Siyin Wang, Qipeng Guo, Jingjing Gong, Xipeng Qiu,
- Abstract summary: Large Vision-Language Models (LVLMs) show promise for embodied planning tasks but struggle with complex scenarios.<n>We propose World-Aware Planning Narrative Enhancement (WAP), a framework that infuses LVLMs with comprehensive environmental understanding.
- Score: 48.97399087613431
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) show promise for embodied planning tasks but struggle with complex scenarios involving unfamiliar environments and multi-step goals. Current approaches rely on environment-agnostic imitation learning that disconnects instructions from environmental contexts, causing models to struggle with context-sensitive instructions and rely on supplementary cues rather than visual reasoning during long-horizon interactions. In this work, we propose World-Aware Planning Narrative Enhancement (WAP), a framework that infuses LVLMs with comprehensive environmental understanding through four cognitive capabilities (visual appearance modeling, spatial reasoning, functional abstraction, and syntactic grounding) while developing and evaluating models using only raw visual observations through curriculum learning. Evaluations on the EB-ALFRED benchmark demonstrate substantial improvements, with Qwen2.5-VL achieving a 60.7 absolute improvement in task success rates, particularly in commonsense reasoning (+60.0) and long-horizon planning (+70.0). Notably, our enhanced open-source models outperform proprietary systems like GPT-4o and Claude-3.5-Sonnet by a large margin.
Related papers
- VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z) - EmbRACE-3K: Embodied Reasoning and Action in Complex Environments [48.32142591866083]
EmRACE-3K is a dataset of over 3,000 language-guided tasks constructed using Unreal Engine and the UnrealCV-Zoo framework.<n>We establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution.<n>In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments.
arXiv Detail & Related papers (2025-07-14T17:59:46Z) - EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving [61.99289768925256]
EvaLearn is a benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks.<n>We benchmark nine frontier models and observe varied performance profiles.<n>We observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks.
arXiv Detail & Related papers (2025-06-03T09:18:33Z) - CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification [0.20482269513546458]
This paper presents our approach to the SemEval-2025 Task6 (PromiseEval), which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports.<n>We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing.<n>Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.
arXiv Detail & Related papers (2025-05-29T15:19:00Z) - VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility [31.50489359729733]
VISO-Grasp is a vision-language--informed system designed to address visibility constraints for grasping in severely occluded environments.<n>VISO-Grasp achieves a success rate of $87.5%$ in target-oriented grasping with the fewest grasp attempts outperforming baselines.
arXiv Detail & Related papers (2025-03-16T18:46:54Z) - LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds.
Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines.
We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.