MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
- URL: http://arxiv.org/abs/2510.19423v1
- Date: Wed, 22 Oct 2025 09:45:11 GMT
- Title: MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
- Authors: Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai,
- Abstract summary: MSC-Bench is a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents.<n>It addresses gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score.<n>It systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional overlap and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score and reducing the dependency on LLM-as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at https://github.com/snooow1029/MSC_Bench.
Related papers
- DEP: A Decentralized Large Language Model Evaluation Protocol [51.3646001384887]
Decentralized Evaluation Protocol (DEP) is a decentralized yet unified and standardized evaluation framework.<n>By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation.<n>We develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control.
arXiv Detail & Related papers (2026-03-01T16:10:16Z) - Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.
arXiv Detail & Related papers (2026-02-03T14:48:12Z) - Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis [2.903627214446312]
We introduce an architectural taxonomy for systematically comparing multi-agent LLM frameworks along fundamental dimensions.<n>We develop a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline.<n>Our results show that framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%.
arXiv Detail & Related papers (2026-02-03T05:37:56Z) - MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability [37.727210168531364]
MAESTRO is an evaluation suite for the testing, reliability, and observability of LLM-based MAS.<n>We instantiate MAESTRO with 12 representative MAS spanning popular agentic frameworks and interaction patterns.<n>Our case studies show that MAS executions can be structurally stable yet temporally variable, leading to substantial run-to-run variance.
arXiv Detail & Related papers (2026-01-01T21:25:52Z) - PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code [1.1164117387254457]
Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI.<n>Key requirement for these systems is their ability to accurately follow user instructions.<n>We present PACIFIC, a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities.
arXiv Detail & Related papers (2025-12-11T14:49:56Z) - On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset [16.921428284844684]
Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable reasoning systems.<n>We present a framework that augments large language models with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration.
arXiv Detail & Related papers (2025-10-27T00:58:48Z) - ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases [58.411135609139855]
"Shortcuts" to complete tasks pose significant risks for reliable assessment and deployment of large language models.<n>We introduce ImpossibleBench, a benchmark framework that measures LLM agents' propensity to exploit test cases.<n>As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool.
arXiv Detail & Related papers (2025-10-23T06:58:32Z) - RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems [31.4909149697414]
Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)<n>Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries.<n>We propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG.
arXiv Detail & Related papers (2025-10-15T04:13:00Z) - MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments [6.12783571098263]
MEMTRACK is a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments.<n>Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information.<n>Our benchmark tests memory capabilities such as acquistion, selection and conflict resolution.
arXiv Detail & Related papers (2025-10-01T18:34:03Z) - Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts [75.20929587906228]
Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks.<n>However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts.
arXiv Detail & Related papers (2025-09-27T08:43:34Z) - Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z) - MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models [76.72220653705679]
We introduce MCPEval, an open-source framework that automates end-to-end task generation and deep evaluation of intelligent agents.<n> MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines.<n> Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance.
arXiv Detail & Related papers (2025-07-17T05:46:27Z) - Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration [4.66888457790348]
We present Gradientsys, a next-generation multi-agent scheduling framework.<n>It coordinates diverse specialized AI agents using a typed Model-Context Protocol (MCP) and a ReAct-based dynamic planning loop.<n>Experiments on the GAIA general-assistant benchmark show that Gradientsys achieves higher task success rates with reduced latency and lower API costs.
arXiv Detail & Related papers (2025-07-09T03:40:56Z) - The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [1.6249398255272318]
We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving.<n>We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure.<n>These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks.
arXiv Detail & Related papers (2025-06-14T00:25:26Z) - Benchmarking LLMs' Swarm intelligence [51.648605206159125]
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) remains largely unexplored.<n>We introduce SwarmBench, a novel benchmark designed to systematically evaluate tasks of LLMs acting as decentralized agents.<n>We propose metrics for coordination effectiveness and analyze emergent group dynamics.
arXiv Detail & Related papers (2025-05-07T12:32:01Z) - Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z) - SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable [4.953092503184905]
Large language models (LLMs) have demonstrated remarkable performance, yet their diverse strengths and weaknesses prevent any single LLM from achieving dominance across all tasks.<n>This work introduces Scalable Consistency Ensemble (SCE), an efficient framework for ensembling LLMs by prompting consistent outputs.
arXiv Detail & Related papers (2025-03-13T20:54:28Z) - AgentBench: Evaluating LLMs as Agents [99.12825098528212]
Large Language Model (LLM) as agents has been widely acknowledged recently.<n>We present AgentBench, a benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.