Fugu-MT 論文翻訳(概要): SWE Context Bench: A Benchmark for Context Learning in Coding

論文の概要: SWE Context Bench: A Benchmark for Context Learning in Coding

arxiv url: http://arxiv.org/abs/2602.08316v1
Date: Mon, 09 Feb 2026 06:44:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 20:26:25.09107
Title: SWE Context Bench: A Benchmark for Context Learning in Coding
Title（参考訳）: SWE Context Bench: コーディングにおけるコンテキスト学習のベンチマーク
Authors: Jared Zhu, Minhao Hu, Junde Wu,
Abstract要約: SWE-ContextBenchは,プログラムエージェントでの体験再利用を明示的に評価するためのベンチマークである。 SWE-Bench Lite上に構築されたSWE-ContextBenchは、GitHubイシューとプルリクエスト間の実際の依存関係と参照関係から99の関連タスクで300のベースタスクを拡張している。適切に選択された要約された体験により、解像度が向上し、実行時間とトークンコストが大幅に削減されることを示す。
参考スコア（独自算出の注目度）: 6.093520696434546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly used as programming agents for repository level software engineering tasks. While recent benchmarks evaluate correctness in realistic codebases, they largely treat tasks as independent and do not assess whether agents can reuse experience across related problems. As a result, the ability of agents to accumulate, retrieve, and apply prior experience, as well as the efficiency gains from such reuse, remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate experience reuse in programming agents. Built on SWE-Bench Lite, SWE-ContextBench augments 300 base tasks with 99 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests, forming task sequences with shared context. The benchmark evaluates agents along three complementary dimensions: prediction accuracy, time efficiency, and cost efficiency. Using SWE-ContextBench, we study multiple experience reuse settings, including oracle guided and autonomous retrieval, as well as full execution trajectories and compact summaries. Our results show that correctly selected summarized experience improves resolution accuracy and substantially reduces runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected experience provides limited or negative benefits. These findings highlight the importance of experience representation and retrieval quality, and position SWE-ContextBench as a principled benchmark for studying experience reuse in programming agents.
Abstract（参考訳）: 大規模言語モデルは、リポジトリレベルのソフトウェアエンジニアリングタスクのプログラミングエージェントとして、ますます使われています。最近のベンチマークでは、現実的なコードベースの正確性を評価しているが、それらはタスクを主に独立して扱い、エージェントが関連する問題で経験を再利用できるかどうかを評価していない。その結果, エージェントによる事前経験の蓄積, 回収, 適用能力, 再利用による効率性の向上は, 測定が困難である。 SWE-ContextBenchは,プログラムエージェントでの体験再利用を明示的に評価するためのベンチマークである。 SWE-Bench Lite上に構築されたSWE-ContextBenchは、GitHubイシューとプルリクエスト間の実際の依存性と参照関係から99の関連タスクで300のベースタスクを拡張し、共有コンテキストでタスクシーケンスを生成する。このベンチマークは、予測精度、時間効率、コスト効率の3つの相補的な次元に沿ってエージェントを評価する。 SWE-ContextBenchを用いて、オラクルガイドや自律検索、フル実行トラジェクトリ、コンパクトサマリーなど、複数の体験再利用設定について検討する。この結果から, 適切に選択された要約経験により, 解決精度が向上し, 特に困難なタスクにおいて, 実行時間とトークンコストを大幅に削減できることがわかった。対照的に、フィルタされていない、または誤って選択された経験は、限定的または負の利益をもたらす。これらの知見は、経験表現と検索品質の重要性を強調し、SWE-ContextBenchをプログラミングエージェントにおける経験の再利用を研究するための基準として位置づけた。

論文の概要: SWE Context Bench: A Benchmark for Context Learning in Coding

関連論文リスト