Fugu-MT 論文翻訳(概要): Scaling Laws for Agent Harnesses via Effective Feedback Compute

論文の概要: Scaling Laws for Agent Harnesses via Effective Feedback Compute

arxiv url: http://arxiv.org/abs/2605.29682v1
Date: Thu, 28 May 2026 09:45:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.154628
Title: Scaling Laws for Agent Harnesses via Effective Feedback Compute
Title（参考訳）: 効果的なフィードバック計算によるエージェントハーネスのスケーリング法則
Authors: Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che,
Abstract要約: emphEffective Feedback Compute (EFC)は、情報的、有効、非冗長な場合にのみフィードバックを信用し、その後の決定のために保持するトレースレベルのスケーリング座標である。 EFCベースの座標は、生の計算ベースラインよりも失敗率を常に予測する。
参考スコア（独自算出の注目度）: 53.68149869349268
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure -- tokens, tool calls, operations, wall time, or cost -- which does not distinguish useful feedback from redundant or unstable interaction. We introduce \emph{Effective Feedback Compute} (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation ($R^2=0.33$ and $0.42$), SAS reaches $0.88$, while Oracle-EFC and Estimated-EFC reach $0.94$ and Oracle-EFC/$D_{\mathrm{task}}$ reaches $0.99$. Matched-budget interventions show that improving feedback quality raises success from $0.27$ to $0.90$ while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/$D_{\mathrm{task}}$ reaches $R^2=0.92$ while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout ($R^2=0.85$). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.
Abstract（参考訳）: エージェントハーネスは、モデルがどのようにツールを呼び出すかを決定し、フィードバックを受け取り、中間状態の検証、メモリの保存、ソリューションの修正によって、言語モデルシステムのパフォーマンスをますます決定します。しかし、現在のテスト時間のスケーリング分析は、しばしば、トークン、ツールコール、オペレーション、壁時間、コストといった生の支出によってこのプロセスをパラメータ化します。 EFC(emph{Effective Feedback Compute})は、情報、有効性、非冗長性、およびその後の意思決定にのみフィードバックを信用するトレースレベルのスケーリング座標であり、タスクを異なるフィードバック要求と比較する際にタスク要求によって正規化する。合成制御可能なタスク、実行可能なコードタスク、実際のベンチマークトレース、ホールトアウトスプリット、予測検証バッチなど、EFCベースの座標は、生の計算ベースラインと強力な多変量SASベースラインよりも一貫して障害率を予測します。 R^2=0.33$と0.42$)、SASは0.88$、Oracle-EFCとEstimated-EFCは0.94$、Oracle-EFC/$D_{\mathrm{task}}$は0.99$である。一致した予算の介入は、フィードバック品質の改善が成功を0.27ドルから0.90ドルに引き上げ、生のコストとツールコールが固定されることを示している。混合実トレースでは、NRS-EFC/$D_{\mathrm{task}}$が$R^2=0.92$に達する。これらの結果から, 資源予算を持続的かつタスクに十分なフィードバックにいかに効率的に変換するかよりも, スケールの効率は, どれだけの計算に費やされているかによって制御されることが示唆された。

論文の概要: Scaling Laws for Agent Harnesses via Effective Feedback Compute

関連論文リスト