Related papers: MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration

MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration

URL: http://arxiv.org/abs/2510.19423v1
Date: Wed, 22 Oct 2025 09:45:11 GMT
Title: MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
Authors: Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai,
Abstract summary: MSC-Bench is a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents.<n>It addresses gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score.<n>It systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional overlap and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score and reducing the dependency on LLM-as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at https://github.com/snooow1029/MSC_Bench.

Related papers

DEP: A Decentralized Large Language Model Evaluation Protocol [51.3646001384887]
Decentralized Evaluation Protocol (DEP) is a decentralized yet unified and standardized evaluation framework.<n>By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation.<n>We develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control.
arXiv Detail & Related papers (2026-03-01T16:10:16Z)
Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation [50.22481337087162]
Referring Video Object (RVOS) aims to segment objects in videos based on textual queries.<n>Refer-Agent is a collaborative multi-agent system with alternating reasoning-reflection mechanisms.
arXiv Detail & Related papers (2026-02-03T14:48:12Z)
Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis [2.903627214446312]
We introduce an architectural taxonomy for systematically comparing multi-agent LLM frameworks along fundamental dimensions.<n>We develop a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline.<n>Our results show that framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%.
arXiv Detail & Related papers (2026-02-03T05:37:56Z)
MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability [37.727210168531364]
MAESTRO is an evaluation suite for the testing, reliability, and observability of LLM-based MAS.<n>We instantiate MAESTRO with 12 representative MAS spanning popular agentic frameworks and interaction patterns.<n>Our case studies show that MAS executions can be structurally stable yet temporally variable, leading to substantial run-to-run variance.
arXiv Detail & Related papers (2026-01-01T21:25:52Z)
PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code [1.1164117387254457]
Large Language Model (LLM)-based code assistants have emerged as a powerful application of generative AI.<n>Key requirement for these systems is their ability to accurately follow user instructions.<n>We present PACIFIC, a novel framework designed to automatically generate benchmarks that rigorously assess sequential instruction-following and code dry-running capabilities.
arXiv Detail & Related papers (2025-12-11T14:49:56Z)
On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset [16.921428284844684]
Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable reasoning systems.<n>We present a framework that augments large language models with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration.
arXiv Detail & Related papers (2025-10-27T00:58:48Z)
ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases [58.411135609139855]
"Shortcuts" to complete tasks pose significant risks for reliable assessment and deployment of large language models.<n>We introduce ImpossibleBench, a benchmark framework that measures LLM agents' propensity to exploit test cases.<n>As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool.
arXiv Detail & Related papers (2025-10-23T06:58:32Z)
RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems [31.4909149697414]
Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)<n>Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries.<n>We propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG.
arXiv Detail & Related papers (2025-10-15T04:13:00Z)
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments [6.12783571098263]
MEMTRACK is a benchmark designed to evaluate long-term memory and state tracking in multi-platform agent environments.<n>Each benchmark instance provides a chronologically platform-interleaved timeline, with noisy, conflicting, cross-referring information.<n>Our benchmark tests memory capabilities such as acquistion, selection and conflict resolution.
arXiv Detail & Related papers (2025-10-01T18:34:03Z)
Diagnose, Localize, Align: A Full-Stack Framework for Reliable LLM Multi-Agent Systems under Instruction Conflicts [75.20929587906228]
Large Language Model (LLM)-powered multi-agent systems (MAS) have rapidly advanced collaborative reasoning, tool use, and role-specialized coordination in complex tasks.<n>However, reliability-critical deployment remains hindered by a systemic failure mode: hierarchical compliance under instruction conflicts.
arXiv Detail & Related papers (2025-09-27T08:43:34Z)
Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.83217247686402]
Large Language Models (LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions.<n>Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance.<n>This paper decomposes LLM applications into a three-layer architecture: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, and textbftextitLLM Inference Core.
arXiv Detail & Related papers (2025-08-28T13:00:28Z)
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models [76.72220653705679]
We introduce MCPEval, an open-source framework that automates end-to-end task generation and deep evaluation of intelligent agents.<n> MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines.<n> Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance.
arXiv Detail & Related papers (2025-07-17T05:46:27Z)
Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration [4.66888457790348]
We present Gradientsys, a next-generation multi-agent scheduling framework.<n>It coordinates diverse specialized AI agents using a typed Model-Context Protocol (MCP) and a ReAct-based dynamic planning loop.<n>Experiments on the GAIA general-assistant benchmark show that Gradientsys achieves higher task success rates with reduced latency and lower API costs.
arXiv Detail & Related papers (2025-07-09T03:40:56Z)
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason [1.6249398255272318]
We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving.<n>We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure.<n>These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks.
arXiv Detail & Related papers (2025-06-14T00:25:26Z)
Benchmarking LLMs' Swarm intelligence [51.648605206159125]
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) remains largely unexplored.<n>We introduce SwarmBench, a novel benchmark designed to systematically evaluate tasks of LLMs acting as decentralized agents.<n>We propose metrics for coordination effectiveness and analyze emergent group dynamics.
arXiv Detail & Related papers (2025-05-07T12:32:01Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
SCE: Scalable Consistency Ensembles Make Blackbox Large Language Model Generation More Reliable [4.953092503184905]
Large language models (LLMs) have demonstrated remarkable performance, yet their diverse strengths and weaknesses prevent any single LLM from achieving dominance across all tasks.<n>This work introduces Scalable Consistency Ensemble (SCE), an efficient framework for ensembling LLMs by prompting consistent outputs.
arXiv Detail & Related papers (2025-03-13T20:54:28Z)
AgentBench: Evaluating LLMs as Agents [99.12825098528212]
Large Language Model (LLM) as agents has been widely acknowledged recently.<n>We present AgentBench, a benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.