Fugu-MT 論文翻訳(概要): AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

論文の概要: AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

arxiv url: http://arxiv.org/abs/2605.00334v1
Date: Fri, 01 May 2026 01:25:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.811763
Title: AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Title（参考訳）: AgentFloor: 小さなオープンウェイトモデルでどのくらい使えるのか?
Authors: Ranit Karmakar, Jayita Chatterjee,
Abstract要約: AgentFloorは6層機能ラグとして整理された決定論的30タスクベンチマークである。我々は,0.27Bから32Bパラメータの16個のオープンウェイトモデルと16,542回のスコアランでGPT-5を評価した。この結果から, 小型・中級のオープンウェイトモデルでは, 短期的, 構造的ツールの使用作業の多くに十分であることがわかった。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. This raises a practical routing question that existing evaluations do not directly answer: which parts of an agent workflow truly require large frontier intelligence, and which can be handled by smaller models? We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints. We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs. Our results reveal a clear boundary of model necessity. Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines, and in aggregate, the strongest open-weight model matches GPT-5 on our benchmark while being substantially cheaper and faster to run. The gap appears most clearly on long-horizon planning tasks that require sustained coordination and reliable constraint tracking over many steps, where frontier models still hold an advantage, though neither side reaches strong reliability. We also find that this boundary is not explained by scale alone: some failures respond to targeted interventions, but the effects are model-specific rather than universal. These findings suggest a practical design principle for agentic systems: use smaller open-weight models for the broad base of routine actions, and reserve large frontier models for the narrower class of tasks that truly demand deeper planning and control. We release the benchmark, harness, sweep configurations, and full run corpus.
Abstract（参考訳）: プロダクションエージェントシステムは、ユーザ要求毎に多くのモデルコールを行います。エージェントワークフローのどの部分が本当に大きなフロンティアインテリジェンスを必要とし、より小さなモデルで処理できるのか? AgentFloorは、6層機能ラグとして整理された決定論的30タスクのベンチマークで、命令のフォロー、ツールの使用、複数ステップの調整、長期計画などを永続的な制約下で実施する。我々は,0.27Bから32Bパラメータの16個のオープンウェイトモデルと16,542回のスコアランでGPT-5を評価した。私たちの結果は、モデルの必要性の明確な境界を明らかにします。中小規模のオープンウェイトモデルはすでに、実際のエージェントパイプラインを支配した、短期間で構造化されたツールの使用作業の多くに十分です。このギャップは、フロンティアモデルが依然として優位を保っている多くのステップにおいて、持続的な調整と信頼性の高い制約追跡を必要とする長期計画タスクに最も顕著に現れている。いくつかの障害は対象とする介入に反応するが、その影響は普遍的ではなくモデル固有のものである。これらの知見はエージェントシステムの実用的な設計原則を示唆している: より小さなオープンウェイトモデルを使用して、より広範なルーチンアクションのベースにし、より狭いレベルのタスクのために、より深い計画と制御を要求する大きなフロンティアモデルを保存する。ベンチマーク、ハーネス、スイープ構成、フルランコーパスをリリースしています。

論文の概要: AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

関連論文リスト