Fugu-MT 論文翻訳(概要): MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

論文の概要: MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

arxiv url: http://arxiv.org/abs/2602.03053v1
Date: Tue, 03 Feb 2026 03:30:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-04 18:37:15.220895
Title: MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems
Title（参考訳）: MAS-ProVe:マルチエージェントシステムのプロセス検証の理解
Authors: Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Austin Xu, Xiaoxiao He, Yingbo Zhou, Semih Yavuz, Hao Wang, Shafiq Joty,
Abstract要約: マルチエージェントシステム(MAS)におけるプロセス検証の系統的研究であるMAS-ProVeを提案する。本研究は3つの検証パラダイム(LLM-as-a-Judge、報酬モデル、プロセス報酬モデル)にまたがる。プロセスレベルの検証は、常に性能を改善しておらず、しばしば高いばらつきを示す。
参考スコア（独自算出の注目度）: 59.20800753428596
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multi-Agent Systems (MAS) built on Large Language Models (LLMs) often exhibit high variance in their reasoning trajectories. Process verification, which evaluates intermediate steps in trajectories, has shown promise in general reasoning settings, and has been suggested as a potential tool for guiding coordination of MAS; however, its actual effectiveness in MAS remains unclear. To fill this gap, we present MAS-ProVe, a systematic empirical study of process verification for multi-agent systems (MAS). Our study spans three verification paradigms (LLM-as-a-Judge, reward models, and process reward models), evaluated across two levels of verification granularity (agent-level and iteration-level). We further examine five representative verifiers and four context management strategies, and conduct experiments over six diverse MAS frameworks on multiple reasoning benchmarks. We find that process-level verification does not consistently improve performance and frequently exhibits high variance, highlighting the difficulty of reliably evaluating partial multi-agent trajectories. Among the methods studied, LLM-as-a-Judge generally outperforms reward-based approaches, with trained judges surpassing general-purpose LLMs. We further observe a small performance gap between LLMs acting as judges and as single agents, and identify a context-length-performance trade-off in verification. Overall, our results suggest that effective and robust process verification for MAS remains an open challenge, requiring further advances beyond current paradigms. Code is available at https://github.com/Wang-ML-Lab/MAS-ProVe.
Abstract（参考訳）: LLM(Large Language Models)上に構築されたマルチエージェントシステム(MAS)は、しばしば推論軌跡に高いばらつきを示す。軌道の中間段階を評価するプロセス検証は、一般的な推論設定において有望であり、MASの協調を導くための潜在的ツールとして提案されているが、MASにおける実際の有効性は未だ不明である。このギャップを埋めるために,マルチエージェントシステム(MAS)におけるプロセス検証の体系的研究であるMAS-ProVeを提案する。本研究は,3つの検証パラダイム(LLM-as-a-Judge,報酬モデル,プロセス報酬モデル)にまたがって,検証の粒度(エージェントレベル,イテレーションレベル)の2レベルにわたって評価を行った。さらに,5つの代表的な検証手法と4つのコンテキスト管理戦略について検討し,複数の推論ベンチマーク上で6つのMASフレームワーク上で実験を行った。プロセスレベルの検証は性能を一定に向上せず,高い分散性を示し,部分的マルチエージェントトラジェクトリを確実に評価することの難しさを浮き彫りにしている。 LLM-as-a-Judgeは一般的に報酬に基づくアプローチよりも優れており、訓練された審査員は汎用LLMを超越している。さらに、審査員や単一エージェントとして機能するLCM間の小さなパフォーマンスギャップを観察し、検証における文脈長性能トレードオフを特定する。以上の結果から,MASの有効かつ堅牢なプロセス検証は依然としてオープンな課題であり,現在のパラダイムを超えてさらなる進歩が必要であることが示唆された。コードはhttps://github.com/Wang-ML-Lab/MAS-ProVeで入手できる。

論文の概要: MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems

関連論文リスト