Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol
- URL: http://arxiv.org/abs/2512.03955v1
- Date: Wed, 03 Dec 2025 16:49:14 GMT
- Title: Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol
- Authors: Niklas Jobs, Luis Miguel Vieira da Silva, Jayanth Somashekaraiah, Maximilian Weigand, David Kube, Felix Gehlhoff,
- Abstract summary: We introduce a benchmark with an executable simulation environment representing the Blocksworld problem.<n>By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark.
- Score: 0.19544534628180865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark's applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.
Related papers
- DEP: A Decentralized Large Language Model Evaluation Protocol [51.3646001384887]
Decentralized Evaluation Protocol (DEP) is a decentralized yet unified and standardized evaluation framework.<n>By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation.<n>We develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control.
arXiv Detail & Related papers (2026-03-01T16:10:16Z) - The Auton Agentic AI Framework [5.410458076724158]
The field of Artificial Intelligence is undergoing a transition from Generative AI to Agentic AI.<n>This transition exposes a fundamental architectural mismatch: Large Language Models (LLMs) produce unstructured outputs, whereas the backend infrastructure they must control requires deterministic, schema-conformant inputs.<n>The present paper describes the Auton Agentic AI Framework, a principled architecture for the creation, creation, and governance of autonomous agent.
arXiv Detail & Related papers (2026-02-27T06:42:08Z) - NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents [41.70615840873279]
We present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimization implementations.<n>NEMO centers on remote interaction with autonomous coding agents (ACAs), treated as a first-class abstraction analogous to API-based interaction with LLMs.<n>Because ACAs execute within sandboxed environments, code produced by NEMO is executable by construction, allowing automated validation and repair.
arXiv Detail & Related papers (2026-01-29T07:57:23Z) - HEAS: Hierarchical Evolutionary Agent Simulation Framework for Cross-Scale Modeling and Multi-Objective Search [4.807104001943257]
Hierarchical Simulation Agent (HEAS) is a Python framework that unifies layered agent-based modeling with evolutionary optimization and tournament evaluation.<n>HEAS represents models as hierarchies of lightweight processes ("streams") scheduled in deterministic layers that read and write a shared context.<n> compact API and CLI-simulate, optimize, evaluate-expose single- and multi-objective evolution.
arXiv Detail & Related papers (2025-08-21T13:35:46Z) - MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z) - MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models [76.72220653705679]
We introduce MCPEval, an open-source framework that automates end-to-end task generation and deep evaluation of intelligent agents.<n> MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines.<n> Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance.
arXiv Detail & Related papers (2025-07-17T05:46:27Z) - Adaptive Domain Modeling with Language Models: A Multi-Agent Approach to Task Planning [5.638621244710438]
TAPAS employs specialized LLM-based agents that collaboratively generate and adapt domain models.<n>A ReAct (Reason+Act)-style execution agent, coupled with natural language plan translation, bridges the gap between dynamically generated plans and real-world robot capabilities.
arXiv Detail & Related papers (2025-06-24T13:02:06Z) - BLADE: Benchmark suite for LLM-driven Automated Design and Evolution of iterative optimisation heuristics [2.2485774453793037]
BLADE is a framework for benchmarking LLM-driven AAD methods in a continuous black-box optimisation context.<n>It integrates benchmark problems with instance generators and textual descriptions aimed at capability-focused testing, such as specialisation and information exploitation.<n> BLADE provides an out-of-the-box' solution to systematically evaluate LLM-driven AAD approaches.
arXiv Detail & Related papers (2025-04-28T18:34:09Z) - Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.<n>Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.<n>We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z) - REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks [2.1331883629523634]
The suite encompasses 14 designed planning and scheduling problems that progress from basic to highly complex.<n>Each problem can be scaled along three dimensions: the number of parallel planning threads, the complexity of inter-dependencies, and the frequency of unexpected disruptions.<n>The benchmark aims to be opened to public, and drive progress in developing more adaptable, robust, and scalable AI planning systems for Real-world applications.
arXiv Detail & Related papers (2025-02-26T05:24:22Z) - Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring.
Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations.
Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z) - A Modular Framework for Reinforcement Learning Optimal Execution [68.8204255655161]
We develop a modular framework for the application of Reinforcement Learning to the problem of Optimal Trade Execution.
The framework is designed with flexibility in mind, in order to ease the implementation of different simulation setups.
arXiv Detail & Related papers (2022-08-11T09:40:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.