Fugu-MT 論文翻訳(概要): AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

論文の概要: AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

arxiv url: http://arxiv.org/abs/2605.08647v1
Date: Sat, 09 May 2026 03:35:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.791129
Title: AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
Title（参考訳）: AgentCollabBench: 良いエージェントが悪役になるときの診断
Authors: Aritra Mazumder, Shubhashis Roy Dipta, Nusrat Jahan Lia, Tanzila Khan, Kainat Raisa Hossain, Nehaa Shri, Shubhrangshu Debsarkar, Humayra Tasnim, Gour Gupal Talukder Shawon, Debjoty Mitra, Sumaiya Ahmed Rani, Al Jami Islam Anik, Al Nafeu Khan,
Abstract要約: AgentCollabBenchは、ソフトウェアエンジニアリング、DevOps、データエンジニアリングにまたがる900の人為的なタスクの診断ベンチマークです。各タスクは、4つの行動リスクのうちの1つを分離する。 GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, Llama 3.1 8B の4つの近代LCMの評価を行った。通信トポロジは、マルチホップ情報サバイバルにおけるばらつきの7-40%を説明する主要なリスクファクターとして現れる。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent systems achieve state-of-the-art outcomes through peer collaboration. However, when an agent in the pipeline silently drops a constraint, the system's final output may look correct even though the reasoning chain was quietly corrupted, and existing outcome-based evaluations are blind to such multi-hop process failures. To make these vulnerabilities measurable before deployment, we introduce AgentCollabBench, a diagnostic benchmark of 900 human-validated tasks spanning software engineering, DevOps, and data engineering. Each task isolates one of four behavioral risks: instruction decay (does a constraint survive peer pressure?), false-belief contagion (does a falsehood spread through consensus?), context leakage (does information bleed between tasks?), and tracer durability (does marked data reach the final agent?). Evaluating four modern LLMs (GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, and Llama 3.1 8B Instruct), we expose model-specific vulnerability profiles invisible to outcome-only evaluation; Qwen-3.5-35B-A3B, for example, leads on tracer durability and instruction stability, while GPT 4.1 mini leads on leakage containment and false-belief resistance. Beyond per-model differences, communication topology emerges as a primary risk factor that explains 7-40% of the variance in multi-hop information survival. The effect traces to a synthesis bottleneck specific to converging-DAG nodes: an agent weighing competing parent inputs discards constraints carried by a minority branch, a bottleneck structurally absent from linear chains. AgentCollabBench demonstrates that suboptimal topology can silently erase the safeguards of highly capable models, arguing that multi-agent reliability is fundamentally a structural problem and that scaling model intelligence alone is no substitute for architecture.
Abstract（参考訳）: マルチエージェントシステムは、ピアコラボレーションを通じて最先端の成果を達成する。しかし、パイプライン内のエージェントが静かに制約をドロップすると、推論チェーンがひそかに破損しているにもかかわらず、システムの最終的な出力が正しいように見え、既存の結果に基づく評価は、そのようなマルチホッププロセスの失敗に盲目である。これらの脆弱性をデプロイ前に測定可能にするために、AgentCollabBenchを紹介した。AgentCollabBenchは、ソフトウェアエンジニアリング、DevOps、データエンジニアリングにまたがる900の人為的なタスクの診断ベンチマークである。各タスクは、4つの行動リスクのうちの1つを分離する: 命令崩壊(制約はピアプレッシャを生き残るか?)、偽確認伝染(コンセンサスを通じて偽装されるか?)、コンテキストリーク(タスク間で血を流すか?)、トレーサ耐久性(データが最終エージェントに到達するか? 4つの近代LCM(GPT 4.1 mini, Gemini 2.5 Flash Lite, Qwen-3.5-35B-A3B, Llama 3.1 8B Instruct)を評価することで, 結果のみの評価には見えないモデル固有の脆弱性プロファイルを明らかにする。モデルごとの違い以外にも、通信トポロジは、マルチホップ情報の生存率の7～40%のばらつきを説明する主要なリスクファクターとして現れます。競合する親入力を重み付けするエージェントは、線形鎖から構造的に欠落している少数枝の制約を捨てる。 AgentCollabBench氏は、マルチエージェントの信頼性は基本的に構造上の問題であり、モデルのインテリジェンスのみをスケーリングすることはアーキテクチャの代用ではない、と主張する。

関連論文リスト

Beyond the Black Box: Interpretability of Agentic AI Tool Use [0.0]
本稿では,スパースオートエンコーダと線形プローブ上に構築された機械論的・解釈可能性ツールキットを提案する。フレームワークは各アクションの前にモデル状態を読み出し、ツールが必要かどうか、そして次のツールアクションがいかに適切かの両方を推測する。我々は、NVIDIA Nemotron関数呼び出しデータセットから多段階の軌道上のプローブをトレーニングし、GPT-OSS 20BとGemma 3 27Bモデルに同じワークフローを適用する。
論文参考訳（メタデータ） (2026-05-07T19:47:30Z)
LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation [75.05397479715576]
大規模言語モデル(LLM)とエージェントは有望な進歩を示しているが、その真の能力と失敗モードは未だ不明である。 CプログラムのためのLCMおよびエージェントベースの形式仕様生成に関する、最初の体系的および汚染に配慮した研究を提案する。
論文参考訳（メタデータ） (2026-05-02T11:31:33Z)
Beyond the Attention Stability Boundary: Agentic Self-Synthesizing Reasoning Protocols [6.357772907811544]
SSRP(Self- Synthesizing Reasoning Protocols)は、アーキテクチャ計画と手続き実行の分離を実装するメタ認知フレームワークである。提案する実験層は,浅電流に基づく検索パイロット,高エントロピーSOP,セマンティックハイジャック3ホップ多要素合成タスクの3種類である。以上の結果から,GPT 5.4の非定常バニラ基準線が0.1%に崩壊し,SSRPは715X耐力限界を達成した。
論文参考訳（メタデータ） (2026-04-27T14:13:30Z)
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents [66.97968363332465]
エージェントベンチマークの3つのギャップに対処するエンドツーエンド評価スイートであるClaw-Evalを紹介した。 Claw-Evalは3つのグループにまたがる9つのカテゴリにまたがる300の人間検証タスクで構成されている。すべてのエージェントアクションは、3つの独立したエビデンスチャネルを通じて記録される。
論文参考訳（メタデータ） (2026-04-07T17:43:18Z)
Dynamic analysis enhances issue resolution [53.50448142467294]
DAIRA(Dynamic Analysis-enhanced Issue Resolution Agent)は、エージェントの推論サイクルに動的解析を組み込む自動修復フレームワークである。テストトレース駆動の方法論によって駆動されるDAIRAは、軽量モニタを使用して重要なランタイムデータを抽出する。 Gemini 3 Flash Previewを使用すると、DAIRAは新たな最先端(SOTA)パフォーマンスを確立し、SWE-bench Verifiedデータセットで79.4%の解像度を達成する。
論文参考訳（メタデータ） (2026-03-23T14:48:54Z)
Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications [51.56484100374058]
我々は,エビデンスに基づくリリース決定を伴う品質ゲートを導入する自動自己テストフレームワークを提案する。内部展開型多エージェント対話型AIシステムの縦型ケーススタディにより,本フレームワークの評価を行った。
論文参考訳（メタデータ） (2026-03-13T20:44:15Z)
To Throw a Stone with Six Birds: On Agents and Agenthood [0.0]
Six Birds Theory (SBT)は、マクロな物体を原始体ではなく誘導的閉包として扱う。 SBT内では,タイプ正当性評価を行う。我々はこの契約を4つのチェック可能なコンポーネントを用いて有限制御システムで運用する。
論文参考訳（メタデータ） (2026-02-03T10:46:23Z)
How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations [0.0]
ツール使用機能を備えた自律型エージェントとして運用する場合,大規模言語モデル(LLM)がいかに失敗するかを検討する。上座エージェントメリット指数(KAMI)v0.1ベンチマークを用いて、3つの代表モデルから900の実行トレースを解析した。 4つの繰り返し発生する障害アーチタイプを識別する:接地なしでの未熟なアクション、欠落したエンティティを置換する過剰なヘルパフルネス、イントラクタによるコンテキスト汚染に対する脆弱性、脆弱な実行。
論文参考訳（メタデータ） (2025-12-08T12:27:15Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasisは、AIエージェントを体系的にストレステストするための軽量でモデルに依存しない方法である。 TraitBasisは、ステアブルなユーザ特性に対応するアクティベーション空間で方向を学習する。 We observed on average a 2%-30% performance degradation on $tau$-Trait across frontier model。
論文参考訳（メタデータ） (2025-10-06T05:03:57Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。