Fugu-MT 論文翻訳(概要): RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

論文の概要: RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

arxiv url: http://arxiv.org/abs/2606.22678v1
Date: Sun, 21 Jun 2026 21:41:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 07:39:06.10201
Title: RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents
Title（参考訳）: RigorBench: 自律AIコーディングエージェントにおけるエンジニアリングプロセスのベンチマーク
Authors: Meher Bhaskar Madiraju, Meher Sai Preetam Madiraju,
Abstract要約: RigorBenchは、AIコーディングエージェントのプロセス規律を測定する最初のベンチマークである。プランニングフィデリティ、検証カバレッジ、回復効率、吸収品質、原子遷移積分の5つの柱にまたがるハーネスを評価している。その結果,構造化プロセスの規律はプロセス品質のスコアを平均41%向上させ,下流結果の正しさを17%向上させることがわかった。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Agentic coding harnesses - such as Agent-Skills, Superpowers, and Agent-Rigor - are increasingly deployed to augment underlying LLMs for real-world software engineering tasks. Existing benchmarks evaluate these agents almost exclusively on outcome correctness: whether generated code passes tests or resolves issues. We argue that this outcome-only lens is insufficient: an agent that arrives at a correct solution through reckless trial-and-error, without planning, verification, or graceful recovery, is fundamentally less reliable than one that follows sound engineering discipline. We introduce RigorBench, the first benchmark designed to measure process discipline in AI coding agents. RigorBench evaluates these harnesses across five pillars: Planning Fidelity, Verification Coverage, Recovery Efficiency, Abstention Quality, and Atomic Transition Integrity. A composite RigorScore aggregates these dimensions into a single metric via a weighted sum. We curate a suite of 30 tasks spanning five categories - Plan-Then-Build, Verify-Or-Die, Doom Loop Gauntlet, Know When to Fold, and Don't Break the Build-and evaluate leading harnesses in a controlled with/without experimental design against baseline coding assistants. Our results show that structured process discipline not only improves process quality scores by an average of 41% but also raises downstream outcome correctness by 17%, providing the first quantitative evidence that how agents code matters as much as what they produce. We release the full benchmark, scoring rubrics, and trajectory analysis tools as open-source artifacts.
Abstract（参考訳）: Agent-Skills(エージェントスキル)、Superpowers(スーパーパワー)、Agent-Rigor(エージェントリゴール)といったエージェントコーディングハーネスは、現実のソフトウェアエンジニアリングタスクの基盤となるLLMを強化するために、ますます多くデプロイされている。既存のベンチマークでは、生成されたコードがテストに合格するか、問題が解決するかという、結果の正しさをほぼ独占的に評価している。我々は、この結果のみのレンズが不十分であると主張する。無謀な試行錯誤によって正しい解決策にたどり着くエージェントは、計画、検証、優雅な回復なしに、音響工学の規律に従うものよりも基本的には信頼性が低い。 AIコーディングエージェントのプロセス規律を測定するために設計された最初のベンチマークであるRigorBenchを紹介する。 RigorBench氏は、これらのハーネスを、5つの柱、プランニングフィデリティ、検証カバレッジ、リカバリ効率、アテンション品質、アトミックトランジションインテリジェンス(Atomic transition Integrity)で評価している。合成RigorScoreはこれらの次元を重み付き和によって単一の計量に集約する。 Plan-Then-Build、Verify-Or-Die、Doom Loop Gauntlet、Know When to Fold、Don't Break the Build という5つのカテゴリにまたがる30のタスクのスイートをキュレートします。その結果,構造化プロセスの規律はプロセス品質のスコアを平均41%向上させるだけでなく,下流結果の正しさを17%向上させることがわかった。我々は、オープンソースのアーティファクトとして、ルックスをスコアリングする完全なベンチマークとトラジェクトリ分析ツールをリリースします。

論文の概要: RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

関連論文リスト