Fugu-MT 論文翻訳(概要): SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

論文の概要: SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

arxiv url: http://arxiv.org/abs/2603.24755v1
Date: Wed, 25 Mar 2026 19:26:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:47.962336
Title: SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
Title（参考訳）: SlopCodeBench: コーディングエージェントが長時間の反復的なタスクでどのように劣化するかのベンチマーク
Authors: Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Frederic Sala, Aws Albarghouthi,
Abstract要約: 我々は20の問題と93のチェックポイントからなる言語に依存しないベンチマークであるSlopCodeBenchを紹介する。我々は、冗長性、重複コードの割合、構造的侵食という2つの軌道レベルの品質信号を追跡する。 11モデルにまたがるエンドツーエンドの問題を解決するエージェントは存在しない。
参考スコア（独自算出の注目度）: 55.76734816061826
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.
Abstract（参考訳）: ソフトウェア開発は反復的ですが、エージェントによるコーディングベンチマークは、完全な仕様に対してシングルショットのソリューションを圧倒的に評価します。コードはテストスイートをパスできるが、徐々に拡張が困難になる。最近の反復ベンチマークは、このギャップを埋めようとしているが、コード品質が将来の拡張をどのように形成するかを忠実に測定するには、エージェントの設計決定を厳しく制約しすぎている。我々は,20の問題と93のチェックポイントからなる言語に依存しないベンチマークであるSlopCodeBenchを紹介した。我々は、冗長性、重複コードの割合、構造的浸食、複雑度関数に集中した複雑性質量のシェアという2つの軌道レベルの品質信号を追跡する。 11モデルにまたがるエンドツーエンドの問題を解決するエージェントは存在しない。浸食は軌道の80%で増加し、冗長度は89.8%である。 48のオープンソースPythonリポジトリに対して、エージェントコードは2.2倍冗長であり、明らかに浸食されている。時間とともに20のリポジトリを追跡すると、ヒューマンコードがフラットのままであり、エージェントコードが各イテレーションで劣化していることがわかる。即時干渉による研究は、初期品質は改善できるが、劣化を止めることはないことを示している。これらの結果は、パスレートベンチマークが拡張堅牢性を体系的に過小評価していること、そして現在のエージェントは、反復的ソフトウェア開発の要求に対して設計の規律を欠いていることを示している。

論文の概要: SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

関連論文リスト