Fugu-MT 論文翻訳(概要): SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

論文の概要: SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

arxiv url: http://arxiv.org/abs/2605.08366v1
Date: Fri, 08 May 2026 18:21:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.590109
Title: SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
Title（参考訳）: SWE Atlas: 課題解決を超えて、コーディングエージェントのベンチマークを行う
Authors: Mohit Raghavendra, Soham Dan, Miguel Romero Calvo, Yannis Yiming He, Johannes Baptist Mols, Gautam Anand, Cole McCollum, Edgar Arakelyan, Vijay Bharadwaj, Andrew Park, Jeff Da, MohammadHossein Rezaei, Bing Liu, Brad Kenstler, Yunzhong He,
Abstract要約: SWE Atlasは、Codebase Q&A(124タスク)、テストライティング(90タスク)、リファクタリング(70タスク)という、3つのプロフェッショナルソフトウェアエンジニアリングにまたがるコーディングエージェントのためのベンチマークスイートである。あまり表現されていないが事実上重要なカテゴリをターゲットとし、包括的なカテゴリ固有の評価プロトコルを使用し、未指定のルーリックを採用する。全体として、SWE Atlasは、コーディングエージェントの正確性とエンジニアリング品質の両方を測定するための補完的な評価スイートを提供する。
参考スコア（独自算出の注目度）: 16.25554650122462
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance, while even the best open-weight models score poorly. Our analysis suggests that top models rely on extensive codebase exploration and runtime-driven reasoning. However, even top models consistently struggle with subtle edge cases, complex runtime analysis, and adherence to software engineering best practices. Overall, SWE Atlas provides a complementary evaluation suite for measuring both correctness and engineering quality in coding agents.
Abstract（参考訳）: 私たちは、Codebase Q&A(124タスク)、Test Writing(1290タスク)、Refactoring(70タスク)という、3つのプロのソフトウェアエンジニアリングワークフローにまたがるコーディングエージェントのためのベンチマークスイートであるSWE Atlasを紹介します。 SWE Atlas は以前の SWE ベンチマークと3つの重要な方法で異なる: あまり表現されていないが事実上重要なタスクカテゴリをターゲットにし、包括的なカテゴリ固有の評価プロトコルを使用し、現実世界の使用をよりよく反映した、未特定でエージェント的なタスクの定式化を採用する。その評価フレームワークは、プログラムチェックとルーリックベースのアセスメントを組み合わせたものである。これは、テストとリファクタリングの完全性、保守性、再利用可能な抽象化、コードベース衛生など、ソフトウェアエンジニアリングの品質を評価する機能的正確性を超えています。我々は、SWE Atlas上でのフロンティアモデルとオープンウェイトモデルの評価を行い、GPT-5.4とOpus 4.7が、最高のオープンウェイトモデルでさえも、最も優れた全体的なパフォーマンスを達成することを発見した。私たちの分析では、トップモデルは広範なコードベースの探索とランタイム駆動の推論に依存していることを示唆しています。しかし、トップモデルでさえ、微妙なエッジケース、複雑なランタイム分析、ソフトウェアエンジニアリングのベストプラクティスへの固執に一貫して苦労しています。全体として、SWE Atlasは、コーディングエージェントの正確性とエンジニアリング品質の両方を測定するための補完的な評価スイートを提供する。

論文の概要: SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

関連論文リスト