Fugu-MT 論文翻訳(概要): SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

論文の概要: SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

arxiv url: http://arxiv.org/abs/2606.07074v1
Date: Fri, 05 Jun 2026 09:10:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.661179
Title: SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
Title（参考訳）: SlimSearcher:Adaptive Reward Gatingによる効率性を考慮したWebエージェントのトレーニング
Authors: Zequn Xie, Junjie Wang, Dan Yang, Jie Feng, Yue Shen, Jian Wang, Jinjie Gu,
Abstract要約: 深層研究員は複雑な情報探索タスクにおいて顕著な能力を示してきたが、このパワーは計算コストが急上昇している。 SlimSearcherは,SFT(Supervised Fine-Tuning)と強化学習(Reinforcement Learning, RL)にまたがる精度と計算コストのフロンティアを推し進めるフレームワークである。 GAIA、BrowseComp、XBenchDeepSearchといったロングホライゾンベンチマークの実験では、SlimSearcherは平均的なツールコールラウンドを17%から58%削減し、精度を維持または改善している。
参考スコア（独自算出の注目度）: 26.487281765184083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep computational cost. Driven by accuracy-focused training paradigms, current models adopt brute-force strategies characterized by blind tool dependency and performative reasoning-generating long, redundant trajectories that are far from necessary for resolving these tasks, leading to wasteful tool calls and excessive token consumption. To overcome this efficiency trap, we propose SlimSearcher, a principled framework that pushes the Pareto frontier between accuracy and computational cost across both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). In the SFT stage, SlimSearcher employs Pareto-efficient filtration to distill trajectories that are both successful and economical, guiding the model toward inherently efficiency-aware search behaviors. During RL, we introduce Adaptive Reward Gating, a dynamic reward-shaping mechanism that evaluates relative tool and token efficiency within a sampled cohort. By cascading these adaptive efficiency metrics with a strict correctness gate, our approach effectively avoids the brevity bias associated with absolute penalties and mitigates reward hacking. Extensive experiments on long-horizon benchmarks, including GAIA, BrowseComp, and XBenchDeepSearch, demonstrate that SlimSearcher reduces average tool-call rounds by 17%-58% while maintaining or improving accuracy.
Abstract（参考訳）: 深層研究員は複雑な情報探索タスクにおいて顕著な能力を示してきたが、このパワーは計算コストが急上昇している。正確性を重視したトレーニングパラダイムによって駆動される現在のモデルは、ブラインドツール依存と、これらのタスクを解決するために必要な長くて冗長なトラジェクトリを生成するパフォーマンス推論を特徴とするブルートフォース戦略を採用しており、ムダなツールコールと過剰なトークン消費につながります。この効率の罠を克服するため、我々はSlimSearcherを提案する。これはParetoフロンティアを、Supervised Fine-Tuning (SFT) とReinforcement Learning (RL)の両方で精度と計算コストの間に押し上げる、原則的なフレームワークである。 SFTの段階では、SlimSearcherはパレート効率のフィルターを用いて、成功と経済的の両方の軌跡を蒸留し、本質的な効率性に配慮した探索行動に向けてモデルを導く。 RL中、サンプルコホート内の相対工具とトークン効率を評価する動的報酬形成機構であるAdaptive Reward Gatingを導入する。これらの適応効率指標を厳密な正当性ゲートでカスケードすることにより、絶対的な罰則に付随する簡潔さバイアスを効果的に回避し、報酬ハッキングを緩和する。 GAIA、BrowseComp、XBenchDeepSearchといったロングホライゾンベンチマークに関する大規模な実験は、SlimSearcherが平均的なツールコールラウンドを17%から58%削減し、精度を維持または改善していることを示している。

論文の概要: SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

関連論文リスト