Fugu-MT 論文翻訳(概要): Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

論文の概要: Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

arxiv url: http://arxiv.org/abs/2510.08996v1
Date: Fri, 10 Oct 2025 04:42:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:48.112429
Title: Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation
Title（参考訳）: SWE-Bench:リアルエージェント評価のためのベンチマーク変異アプローチ
Authors: Spandan Garg, Ben Steenhoek, Yufan Huang,
Abstract要約: SWE-Bench Verifiedのようなソフトウェアエンジニアリングエージェントを評価するための現在のベンチマークは、主にGitHubの問題に由来する。既存のベンチマークを現実的なユーザクエリに変換する,新たなベンチマークフレームワークを導入する。
参考スコア（独自算出の注目度）: 3.2097144717773287
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.
Abstract（参考訳）: SWE-Bench Verifiedのようなソフトウェアエンジニアリングエージェントを評価するための現在のベンチマークは、主にGitHubの問題に由来するもので、開発者が統合開発環境(IDE)でチャットベースのコーディングアシスタントとどのようにやりとりするかを正確に反映することができない。このミスマッチは、現実世界のシナリオ、特にバグ修正におけるエージェントの能力の体系的な過大評価につながると仮定する。本稿では,チャットベースのエージェントを用いた開発者インタラクションパターンの体系的解析を通じて,既存のベンチマークを現実的なユーザクエリに変換する,新たなベンチマークフレームワークを提案する。私たちの方法論は柔軟で、既存のベンチマークに簡単に拡張できます。本稿では、Multi-SWE-BenchのTypeScriptサブセットであるSWE-Bench VerifiedとプライベートベンチマークであるSWE-Bench C#にテストフレームワークを適用し、一般的なチャットベースのエージェントインタラクションのテレメトリ分析に基づいて、フォーマルなGitHubイシュー記述を現実的なユーザスタイルのクエリに変換する。その結果,既存のベンチマークでは,内部ベンチマークではベースライン性能が50%以上,内部ベンチマークでは10-16%以上であった。この研究は、ベンチマーク突然変異法による対話型チャットベースのソフトウェアエンジニアリングエージェントを評価するための新しいパラダイムを確立する。

論文の概要: Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

関連論文リスト