Related papers: Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models

Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models

URL: http://arxiv.org/abs/2508.12566v1
Date: Mon, 18 Aug 2025 02:06:05 GMT
Title: Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models
Authors: Wei Song, Haonan Zhong, Ziqi Ding, Jingling Xue, Yuekang Li,
Abstract summary: We introduce MCPGAUGE, the first comprehensive evaluation framework for probing LLM-MCP interactions.<n> MCPGAUGE comprises a 160-prompt suite and 25 datasets spanning knowledge comprehension, general reasoning, and code generation.<n>Our large-scale evaluation, spanning six commercial LLMs, 30 MCP tool suites, and both one- and two-turn interaction settings, comprises around 20,000 API calls and over USD 6,000 in computational cost.
Score: 9.49963945880421
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: The Model Context Protocol (MCP) enables large language models (LLMs) to access external resources on demand. While commonly assumed to enhance performance, how LLMs actually leverage this capability remains poorly understood. We introduce MCPGAUGE, the first comprehensive evaluation framework for probing LLM-MCP interactions along four key dimensions: proactivity (self-initiated tool use), compliance (adherence to tool-use instructions), effectiveness (task performance post-integration), and overhead (computational cost incurred). MCPGAUGE comprises a 160-prompt suite and 25 datasets spanning knowledge comprehension, general reasoning, and code generation. Our large-scale evaluation, spanning six commercial LLMs, 30 MCP tool suites, and both one- and two-turn interaction settings, comprises around 20,000 API calls and over USD 6,000 in computational cost. This comprehensive study reveals four key findings that challenge prevailing assumptions about the effectiveness of MCP integration. These insights highlight critical limitations in current AI-tool integration and position MCPGAUGE as a principled benchmark for advancing controllable, tool-augmented LLMs.

Related papers

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling [13.57588221678224]
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling.<n>The boundaries of their capabilities in automated formulation and problem solving remain poorly understood.<n>We propose OPT-ENGINE, a benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels.
arXiv Detail & Related papers (2026-01-09T09:22:33Z)
RelayLLM: Efficient Reasoning via Collaborative Decoding [23.351598429979024]
RelayLLM is a novel framework for efficient reasoning via token-level collaborative decoding.<n>We show that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models.
arXiv Detail & Related papers (2026-01-08T17:56:16Z)
Network and Systems Performance Characterization of MCP-Enabled LLM Agents [2.952262068394116]
Model Context Protocol (MCP) has recently gained increased attention within the AI community for providing a standardized way for large language models (LLMs) to interact with external tools and services.<n>This paper presents a measurement-based analysis of MCP-enabled interactions with LLMs, revealing trade-offs between capability, performance, and cost.
arXiv Detail & Related papers (2025-10-20T05:13:47Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use [27.978435634514128]
MCPMark is a benchmark designed to evaluate MCP use in a more realistic and comprehensive manner.<n>It consists of $127$ high-quality tasks collaboratively created by domain experts and AI agents.<n>On average, LLMs require $16.2$ execution turns and $17.4$ tool calls per task, significantly surpassing those in previous MCP benchmarks.
arXiv Detail & Related papers (2025-09-28T17:53:27Z)
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z)
Integrating External Tools with Large Language Models to Improve Accuracy [0.0]
It is well-known that without relevant contextual information, large language models (LLMs) can provide poor quality responses or tend to hallucinate.<n>Several initiatives have proposed integrating LLMs with external tools to provide them with up-to-date data to improve accuracy.<n>In this paper, we propose a framework to integrate external tools to enhance the capabilities of LLMs in answering queries in educational settings.
arXiv Detail & Related papers (2025-07-09T04:09:59Z)
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models [11.809732662992982]
This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate Large Language Models (LLMs) performance in the Model Context Protocol (MCP) framework.<n>Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains.
arXiv Detail & Related papers (2025-05-22T14:02:37Z)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents [63.43699771428243]
EmbodiedBench is an extensive benchmark designed to evaluate vision-driven embodied agents.<n>We evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench.<n> MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.
arXiv Detail & Related papers (2025-02-13T18:11:34Z)
Ontology Matching with Large Language Models and Prioritized Depth-First Search [0.2454454561635539]
We introduce MILA, a novel approach that embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy.<n>This approach efficiently identifies a large number of semantic correspondences with high accuracy, limiting LLM requests to only the most borderline cases.<n>Our method achieved the highest F-Measure in four of the five unsupervised tasks, outperforming state-of-the-art OM systems by up to 17%.
arXiv Detail & Related papers (2025-01-20T12:29:09Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback [78.60644407028022]
We introduce MINT, a benchmark that evaluates large language models' ability to solve tasks with multi-turn interactions. LLMs generally benefit from tools and language feedback, with performance gains of 1-8% for each turn of tool use. LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities.
arXiv Detail & Related papers (2023-09-19T15:25:42Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.