Fugu-MT 論文翻訳(概要): ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

論文の概要: ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

arxiv url: http://arxiv.org/abs/2605.14133v2
Date: Mon, 18 May 2026 05:36:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:45.988108
Title: ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
Title（参考訳）: ClawForge: コマンドラインエージェントの実行可能なインタラクティブベンチマークを生成する
Authors: Yuxiang Lai, Peng Xia, Haonian Ji, Kaiwen Xiong, Kaide Zeng, Jiaqi Liu, Fang Wu, Jike Zhong, Zeyu Zheng, Cihang Xie, Huaxiu Yao,
Abstract要約: textbfClawForgeは、ステートコンフリクト下で実行可能なコマンドラインカテゴリのためのジェネレータベースのベンチマークフレームワークである。私たちはこのフレームワークをClawForge-Bench(17のシナリオ、6の能力カテゴリ)としてインスタンス化します。
参考スコア（独自算出の注目度）: 59.626170560327274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.
Abstract（参考訳）: インタラクティブエージェントベンチマークは、スケーラブルな構築と現実的なワークフロー評価の緊張に直面する。手作業によるタスクの拡張と修正には費用がかかるが、静的なプロンプト評価は、エージェントが永続的な状態上で動作している場合にのみ現れる障害を見逃す。既存のインタラクティブベンチマークではエージェント評価が大幅に向上しているが、ほとんどの場合クリーンな状態からタスクを初期化し、エージェントが既存の部分的、古い、あるいは矛盾するアーティファクトをどのように扱うかを体系的にテストしない。我々は、ステートコンフリクトの下で実行可能なコマンドラインワークフローのためのジェネレータベースのベンチマークフレームワークである、‘textbf{ClawForge} を提示する。このフレームワークは、シナリオテンプレート、接地されたスロット、初期化状態、参照トラジェクトリ、バリデータを再現可能なタスク仕様にコンパイルし、正常化されたエンドステートと観測可能なサイドエフェクトを使用して、永続的なワークフロー表面をステップオーバーしてエージェントを評価する。このフレームワークをClawForge-Bench(17のシナリオ、6の能力カテゴリ)としてインスタンス化します。 7つのフロンティアモデルに対する結果は、最良のモデルが45.3%の厳密な精度にしか達せず、すべてのモデルで不正な状態置換が17\%以下であり、最も広いモデル分離(17%から90%)はエージェントが行動する前に既存の状態を検査するかどうかによって引き起こされることを示している。部分クレディットとステップ効率の分析により、多くの失敗は早期の故障よりも概略閉鎖であり、状態の衝突の下では定性的に異なる失敗スタイルを示すことが明らかとなった。

論文の概要: ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

関連論文リスト