Fugu-MT 論文翻訳(概要): Scaling Test-Time Compute for Agentic Coding

論文の概要: Scaling Test-Time Compute for Agentic Coding

arxiv url: http://arxiv.org/abs/2604.16529v1
Date: Thu, 16 Apr 2026 17:39:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.054245
Title: Scaling Test-Time Compute for Agentic Coding
Title（参考訳）: エージェント符号化におけるテスト時間計算のスケーリング
Authors: Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, Anirudh Goyal,
Abstract要約: 本稿では,ロールアウト軌跡のコンパクトな表現に基づくエージェントコーディングのためのテスト時間スケーリングフレームワークを提案する。当社のフレームワークは,各ロールアウトを,その健全な仮説,進捗,障害モードを保存する構造的な要約に変換する。提案手法は,SWE-Bench Verified および Terminal-Bench v2.0 におけるフロンティア符号化エージェントの性能を一貫して改善する。
参考スコア（独自算出の注目度）: 126.72747643609274
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.
Abstract（参考訳）: テスト時間のスケーリングは、大規模な言語モデルを改善する強力な方法になっています。しかし、既存の手法は、直接比較したり、ランク付けしたり、洗練したりすることのできる、短くて有界な出力に最も適している。ロングホライゾン符号化エージェントは、この前提に反する: 各試みは、エージェントによって取られた行動、観察、エラー、部分進行の延長軌道を生成する。この設定では、主な課題は、もはやそれ以上の試行を発生させることではなく、効果的に選択して再利用できる形で、事前の経験を表現することである。本稿では,ロールアウト軌跡のコンパクトな表現に基づくエージェントコーディングのためのテスト時間スケーリングフレームワークを提案する。当社のフレームワークは,各ロールアウトを,低信号トレースの詳細を破棄しつつ,その健全な仮説,進捗,障害モードを保存する構造的な要約に変換する。この表現は2つの相補的な推論時間スケーリングを可能にする。並列スケーリングでは,ロールアウトサマリーの人口を小グループ比較によって再帰的に制限するRecursive Tournament Voting(RTV)を導入する。逐次スケーリングでは,先行試行から抽出したサマリーに新たなロールアウトを条件に,PDR(Parallel-Distill-Refine)をエージェント設定に適用する。提案手法は,SWE-Bench Verified および Terminal-Bench v2.0 におけるフロンティア符号化エージェントの性能を一貫して改善する。例えば、Claude-4.5-Opus は SWE-Bench Verified (mini-SWE-agent) において 70.9% から 77.6% に改善され、 Terminal-Bench v2.0 (terminus 1) では 46.9% から 59.1% に改善された。この結果から, 長期エージェントに対するテスト時間スケーリングは, 表現, 選択, 再利用の根本的な問題であることが示唆された。

論文の概要: Scaling Test-Time Compute for Agentic Coding

関連論文リスト