Related papers: MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

URL: http://arxiv.org/abs/2602.03053v1
Date: Tue, 03 Feb 2026 03:30:36 GMT
Title: MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems
Authors: Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Austin Xu, Xiaoxiao He, Yingbo Zhou, Semih Yavuz, Hao Wang, Shafiq Joty,
Abstract summary: We present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS)<n>Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models)<n>We find that process-level verification does not consistently improve performance and frequently exhibits high variance.
Score: 59.20800753428596
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent-level and iteration-level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at https://github.com/Wang-ML-Lab/MAS-ProVe.

Related papers

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains [79.14584837105808]
We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures.<n>Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy.<n>By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning.
arXiv Detail & Related papers (2026-03-01T02:25:57Z)
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress [71.02263260394261]
Large language models (LLMs) still encounter challenges in multi-turn decision-making tasks.<n>We build process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process.<n>AgentPRM captures both the interdependence between sequential decisions and their contribution to the final goal.
arXiv Detail & Related papers (2025-11-11T14:57:54Z)
Prompt Optimization via Retrieved Reasoning Assets and Multi-Agent Analysis [5.935239028627343]
We introduce MA-SAPO, a Multi-Agent framework for Score-Aware Prompt Optimization.<n>Compared to prior methods, MA-SAPO explicitly couples evaluation outcomes with structured reasoning to guide systematic edits.<n>By turning evaluation signals into interpretable reasoning chains, MA-SAPO produces prompt refinements that are more transparent, auditable, and controllable.
arXiv Detail & Related papers (2025-10-18T20:21:09Z)
Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task [22.16139464288789]
Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in abstract visual Reasoning (AVR)<n>Existing benchmarks focus on single-step reasoning, emphasizing the end result but neglecting the multi-stage nature of reasoning process.<n>We introduce MultiStAR, a Multi-Stage benchmark, designed to assess reasoning across varying levels of complexity.
arXiv Detail & Related papers (2025-05-28T00:34:45Z)
Why Do Multi-Agent LLM Systems Fail? [87.90075668488434]
We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks.<n>We build the first Multi-Agent System Failure taxonomy (MAST)<n>We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent)
arXiv Detail & Related papers (2025-03-17T19:04:38Z)
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z)
Watch Every Step! LLM Agent Learning via Iterative Step-Level Process Refinement [50.481380478458945]
Iterative step-level Process Refinement (IPR) framework provides detailed step-by-step guidance to enhance agent training. Our experiments on three complex agent tasks demonstrate that our framework outperforms a variety of strong baselines.
arXiv Detail & Related papers (2024-06-17T03:29:13Z)
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [74.16170899755281]
We introduce AgentBoard, a pioneering comprehensive benchmark and accompanied open-source evaluation framework tailored to analytical evaluation of LLM agents.<n>AgentBoard offers a fine-grained progress rate metric that captures incremental advancements as well as a comprehensive evaluation toolkit.<n>This not only sheds light on the capabilities and limitations of LLM agents but also propels the interpretability of their performance to the forefront.
arXiv Detail & Related papers (2024-01-24T01:51:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.