Fugu-MT 論文翻訳(概要): SWE-RM: Execution-free Feedback For Software Engineering Agents

論文の概要: SWE-RM: Execution-free Feedback For Software Engineering Agents

arxiv url: http://arxiv.org/abs/2512.21919v1
Date: Fri, 26 Dec 2025 08:26:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-29 20:48:42.058074
Title: SWE-RM: Execution-free Feedback For Software Engineering Agents
Title（参考訳）: SWE-RM: ソフトウェアエンジニアリングエージェントに対する実行不要なフィードバック
Authors: KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, Junxian He,
Abstract要約: 実行ベースフィードバックは、テストタイムスケーリング(TTS)と強化学習(RL)を通じて、コーディングエージェントの開発に広く利用されている。対照的に、報酬モデルによる実行不要なフィードバックは、単体テストケースに依存することなく、よりきめ細かい信号を提供することができる。 SWE-RMは,30Bの合計パラメータと3Bのアクティベートされた3Bの混合実験アーキテクチャを採用した,正確で堅牢な報酬モデルである。
参考スコア（独自算出の注目度）: 61.86380395896069
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.
Abstract（参考訳）: 単体テストのような実行ベースのフィードバックは、テスト時間スケーリング(TTS)と強化学習(RL)を通じてコーディングエージェントの開発に広く利用されている。このパラダイムは、正確なフィードバックを提供するために、スケーラブルで信頼性の高い単体テストケースの収集を必要とします。対照的に、報酬モデルによる実行不要なフィードバックは、単体テストケースに依存することなく、よりきめ細かい信号を提供することができる。このような可能性にもかかわらず、現実的なソフトウェアエンジニアリング(SWE)エージェントに対する実行自由フィードバックはいまだに未検討である。しかし, TTS と RL で有効である多目的報酬モデルの開発を目的として, ほぼ同一の TTS 性能を持つ2つの検証器が RL において全く異なる結果が得られることを観察した。直感的には、TSはモデルが最良の軌道を選択する能力を主に反映しているが、この能力は必ずしもRLに一般化するとは限らない。この制限に対処するために、RLトレーニングに不可欠な2つの側面、分類精度と校正を同定する。次に、これらのメトリクス間でうまく機能する堅牢な報酬モデルをトレーニングする方法を研究するために、包括的な制御実験を行います。特に、トレーニングデータスケール、ポリシーミックス、データソース構成など、さまざまな要因の影響を分析する。これらの調査から導かれたSWE-RMは,30Bのパラメータと3Bのパラメータを混合した,正確かつ堅牢な報酬モデルである。 SWE-RMはTSとRLの両方のパフォーマンスにおいてSWEエージェントを大幅に改善する。例えば、Qwen3-Coder-Flashの精度は51.6%から62.0%に向上し、Qwen3-Coder-Maxは67.0%から74.6%に向上した。

論文の概要: SWE-RM: Execution-free Feedback For Software Engineering Agents

関連論文リスト