Related papers: MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

URL: http://arxiv.org/abs/2507.23382v1
Date: Thu, 31 Jul 2025 09:59:17 GMT
Title: MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models
Authors: Yiyan Ji, Haoran Chen, Qiguang Chen, Chengyue Wu, Libo Qin, Wanxiang Che,
Abstract summary: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context.<n>Current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities.<n>We introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning.
Score: 42.30936364450115
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.

Related papers

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning [10.602434753538535]
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence.<n>Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models.<n>We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube.
arXiv Detail & Related papers (2025-06-28T19:44:32Z)
EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z)
Decompose, Plan in Parallel, and Merge: A Novel Paradigm for Large Language Models based Planning with Multiple Constraints [31.631832677979826]
We propose a novel parallel planning paradigm, which Decomposes, Plans for subtasks in Parallel, and Merges subplans into a final plan (DPPM)<n>Specifically, DPPM decomposes the complex task based on constraints into subtasks, generates the subplan for each subtask in parallel, and merges them into a global plan.<n> Experimental results demonstrate that DPPM significantly outperforms existing methods in travel planning tasks.
arXiv Detail & Related papers (2025-06-03T09:33:13Z)
RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning [60.84707424369494]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks.<n>We introduce the Reasoning Boundary Framework++ (RBF++), a framework for evaluating and optimizing measurable boundaries of CoT capability.
arXiv Detail & Related papers (2025-05-19T16:25:55Z)
HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking [109.09735490692202]
We propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning.<n> Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6 times performance improvement over o1-preview.
arXiv Detail & Related papers (2025-05-05T02:38:58Z)
MACI: Multi-Agent Collaborative Intelligence for Adaptive Reasoning and Temporal Planning [2.5200794639628032]
Multi-Agent Collaborative Intelligence (MACI)<n>A framework comprising three key components: 1) a meta-planner (MP) that identifies, formulates, and refines all roles and constraints of a task while generating a dependency graph, with common-sense augmentation to ensure realistic and practical constraints; 2) a collection of agents to facilitate planning and address task-specific requirements; and 3) a run-time monitor that manages plan adjustments as needed.
arXiv Detail & Related papers (2025-01-28T03:57:22Z)
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios [53.26658545922884]
We introduce EgoPlan-Bench2, a benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios.<n>We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning.<n>Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training.
arXiv Detail & Related papers (2024-12-05T18:57:23Z)
Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming [13.246017517159043]
Large language models (LLMs) have recently demonstrated strong potential in solving planning problems.<n>We propose LLpreview, a framework that leverages LLMs to capture key information from planning problems and formally formulate and solve them as optimization problems from scratch.<n>We apply LLpreview to 9 planning problems, ranging from multi-constraint decision making to multi-step planning problems, and demonstrate that LL achieves on average 83.7% and 86.8% optimal rate across 9 tasks for GPTo and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2024-10-15T23:20:54Z)
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes. CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.