Fugu-MT 論文翻訳(概要): Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

論文の概要: Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

arxiv url: http://arxiv.org/abs/2604.11465v2
Date: Wed, 15 Apr 2026 13:28:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 13:09:57.441977
Title: Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents
Title（参考訳）: 3つの役割、1つのモデル:小規模エージェントと大規模エージェント間のパフォーマンスギャップを閉じるための推論時間における役割オーケストレーション
Authors: S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, Antonios Saravanos,
Abstract要約: 複雑なマルチステップ環境において,推論時足場のみに追加のトレーニング計算を使わずに,小さなモデルの性能を向上させることができるかどうかを検討した。我々は,AppWorldベンチマークのQwen3-8Bを,完全精度と4ビット量子化構成の両方で評価した。本格的な推測では、私たちの足場付き8Bモデルは、オリジナルのAppWorld評価からDeepSeek-Coder 33Bインストラクション(7.1%)を上回っています。
参考スコア（独自算出の注目度）: 0.4666493857924357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24GB GPU, we evaluate Qwen3-8B on the AppWorld benchmark under both full-precision and 4-bit quantized configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent's code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9% (FP16) and 5.9% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8% to 26.3% FP16; 5.3% to 14.0% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4 times their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントは、現実的なツール使用タスクを約束するが、控えめなハードウェアに有能なエージェントをデプロイすることは依然として難しい。複雑なマルチステップ環境において,推論時足場のみに追加のトレーニング計算を使わずに,小さなモデルの性能を向上させることができるかどうかを検討した。単一24GBのGPU上で動作し、AppWorldベンチマークのQwen3-8Bを、フル精度と4ビットの量子化構成の両方で評価する。介入なしには、生モデルは5.4%(FP16)と3.0%(AWQ)のタスクゴール完了しか達成できない。システム的障害モード解析により,(1)重要なアーティファクト(トークン,資格情報,API応答)を圧縮しながら保存する要約モデル,(2)圧縮されたコンテキストを理由づけるメインエージェントモデル,(3)会話履歴にアクセスせずにエージェントのコード出力をレビュー・修正する独立した修正モデル,の3つの異なる役割で,同じフリーズモデルをデプロイする3層推論足場パイプラインを導入する。同じ修正されていないモデルに適用すると、この足場は8.9% (FP16) と5.9% (AWQ) のタスクゴール完了を達成し、どちらもほぼ倍の性能を持ち、特に難易度1のタスク(15.8%から26.3% FP16; 5.3%から14.0% AWQ)が向上する。フル精度の推論では、私たちの足場付き8Bモデルは、オリジナルのAppWorld評価からDeepSeek-Coder 33Bインストラクション(7.1%)を越え、構造化された推論時間の介入によって、小さなモデルをシステムと4倍のサイズで競合させることができることを示した。提案手法は,凍結ベースモデル上での足場付きポリシ,異なる条件で同じ重みの3つの呼び出し,テスト時間計算のスケーリングと強化学習における行動空間の整形への接続を形式化する。

論文の概要: Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

関連論文リスト