MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability
- URL: http://arxiv.org/abs/2601.00481v1
- Date: Thu, 01 Jan 2026 21:25:52 GMT
- Title: MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability
- Authors: Tie Ma, Yixi Chen, Vaastav Anand, Alessandro Cornacchia, Amândio R. Faustino, Guanheng Liu, Shan Zhang, Hongbin Luo, Suhaib A. Fahmy, Zafar A. Qazi, Marco Canini,
- Abstract summary: MAESTRO is an evaluation suite for the testing, reliability, and observability of LLM-based MAS.<n>We instantiate MAESTRO with 12 representative MAS spanning popular agentic frameworks and interaction patterns.<n>Our case studies show that MAS executions can be structurally stable yet temporally variable, leading to substantial run-to-run variance.
- Score: 37.727210168531364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MAESTRO, an evaluation suite for the testing, reliability, and observability of LLM-based MAS. MAESTRO standardizes MAS configuration and execution through a unified interface, supports integrating both native and third-party MAS via a repository of examples and lightweight adapters, and exports framework-agnostic execution traces together with system-level signals (e.g., latency, cost, and failures). We instantiate MAESTRO with 12 representative MAS spanning popular agentic frameworks and interaction patterns, and conduct controlled experiments across repeated runs, backend models, and tool configurations. Our case studies show that MAS executions can be structurally stable yet temporally variable, leading to substantial run-to-run variance in performance and reliability. We further find that MAS architecture is the dominant driver of resource profiles, reproducibility, and cost-latency-accuracy trade-off, often outweighing changes in backend models or tool settings. Overall, MAESTRO enables systematic evaluation and provides empirical guidance for designing and optimizing agentic systems.
Related papers
- Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition [53.50448142467294]
RAIM is a multi-design and architecture-aware framework for repository-level feature addition.<n>It shifts away from linear patching by generating multiple diverse implementation designs.<n>Experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance.
arXiv Detail & Related papers (2026-03-02T12:50:40Z) - Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems [0.0]
We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems.<n>The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment.
arXiv Detail & Related papers (2026-02-23T21:37:06Z) - MAS-on-the-Fly: Dynamic Adaptation of LLM-based Multi-Agent Systems at Test Time [32.22206915939924]
We introduce MASFly, a novel multi-agent framework enabling dynamic adaptation at test time.<n>For adaptive execution, MASFly incorporates an experience-guided supervision mechanism, where a dedicated Watcher agent monitors system behaviors.<n>Experiments demonstrate that MASFly achieves state-of-the-art performance, most notably a 61.7% success rate on the TravelPlanner benchmark.
arXiv Detail & Related papers (2026-02-14T08:38:13Z) - Understanding Multi-Agent LLM Frameworks: A Unified Benchmark and Experimental Analysis [2.903627214446312]
We introduce an architectural taxonomy for systematically comparing multi-agent LLM frameworks along fundamental dimensions.<n>We develop a unified evaluation suite that integrates existing benchmarks under a standardized execution pipeline.<n>Our results show that framework-level design choices alone can increase latency by over 100x, reduce planning accuracy by up to 30%, and lower coordination success from above 90% to below 30%.
arXiv Detail & Related papers (2026-02-03T05:37:56Z) - ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning [103.7657839292775]
ARM-Thinker is an Agentic multimodal Reward Model that autonomously invokes external tools to ground judgments in verifiable evidence.<n>We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy.<n>Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.
arXiv Detail & Related papers (2025-12-04T18:59:52Z) - On Generalization in Agentic Tool Calling: CoreThink Agentic Reasoner and MAVEN Dataset [16.921428284844684]
Generalization across Agentic tool-calling environments remains a key unsolved challenge in developing reliable reasoning systems.<n>We present a framework that augments large language models with a lightweight symbolic reasoning layer for structured decomposition and adaptive tool orchestration.
arXiv Detail & Related papers (2025-10-27T00:58:48Z) - MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration [0.0]
MSC-Bench is a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents.<n>It addresses gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score.<n>It systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests.
arXiv Detail & Related papers (2025-10-22T09:45:11Z) - Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z) - Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time [35.31371938688921]
Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference.<n>We propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space.
arXiv Detail & Related papers (2025-09-26T16:49:10Z) - AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production [4.031479494871582]
We present Agent, the first evaluation framework designed specifically for post-deployment monitoring and reasoning of agentic pipeline.<n>Agent achieves state-of-the-art results on key metrics, while uncovering critical issues missed in human annotations.
arXiv Detail & Related papers (2025-09-18T05:59:04Z) - HEAS: Hierarchical Evolutionary Agent Simulation Framework for Cross-Scale Modeling and Multi-Objective Search [4.807104001943257]
Hierarchical Simulation Agent (HEAS) is a Python framework that unifies layered agent-based modeling with evolutionary optimization and tournament evaluation.<n>HEAS represents models as hierarchies of lightweight processes ("streams") scheduled in deterministic layers that read and write a shared context.<n> compact API and CLI-simulate, optimize, evaluate-expose single- and multi-objective evolution.
arXiv Detail & Related papers (2025-08-21T13:35:46Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision [76.42361936804313]
We introduce MAS-ZERO, the first self-evolved, inference-time framework for automatic MAS design.<n> MAS-ZERO employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance.
arXiv Detail & Related papers (2025-05-21T00:56:09Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.