Fugu-MT 論文翻訳(概要): DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

論文の概要: DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

arxiv url: http://arxiv.org/abs/2603.11076v1
Date: Tue, 10 Mar 2026 20:54:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.499768
Title: DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
Title（参考訳）: DIVE:汎用ツールのためのエージェントタスク合成における多様性のスケーリング
Authors: Aili Chen, Chi Zhang, Junteng Liu, Jiangjie Chen, Chengyu Du, Yunji Li, Ming Zhong, Qin Wang, Zhengmao Zhu, Jiayuan Song, Ke Ji, Junxian He, Pengyu Zhao, Yanghua Xiao,
Abstract要約: DIVEデータ(48k SFT + 3.2k RL)上のQwen3-8Bのトレーニングは、9OODベンチマークで+22ポイント向上し、+68で最強の8Bベースラインを上回っている。
参考スコア（独自算出の注目度）: 66.02634251098537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks. Scaling diversity is difficult because training requires tasks to remain executable and verifiable, while generalization demands coverage of diverse tool types, toolset combinations, and heterogeneous tool-use patterns. We propose DIVE, an evidence-driven recipe that inverts synthesis order, executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces, thereby providing grounding by construction. DIVE scales structural diversity along two controllable axes, tool-pool coverage and per-task toolset variety, and an Evidence Collection--Task Derivation loop further induces rich multi-step tool-use patterns across 373 tools in five domains. Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68. Remarkably, controlled scaling analysis reveals that diversity scaling consistently outperforms quantity scaling for OOD generalization, even with 4x less data.
Abstract（参考訳）: ポストトレーニングツールを用いたLCMのエージェントタスクを合成する最近の研究は、タスクやツールセットのシフトによる堅牢な一般化は、依然としてオープンな課題である。我々はこの脆さを、合成タスクにおいて不十分な多様性に辿る。トレーニングは実行可能で検証可能なタスクを必要とするのに対して、一般化にはさまざまなツールタイプ、ツールセットの組み合わせ、異種ツール使用パターンのカバレッジが必要であるため、多様性のスケーリングは難しい。 DIVEは,合成順序を逆転させるエビデンス駆動のレシピであり,多種多様な実世界のツールを最初に実行し,結果のトレースに厳密に関係したタスクを逆導する。 DIVEは2つのコントロール可能な軸、ツールプールカバレッジとタスク毎のツールセット、Evidence Collection-Task Derivationループに沿って構造的な多様性をスケールし、5つのドメインで373のツールにまたがるリッチなマルチステップのツール使用パターンを誘導する。 DIVEデータ(48k SFT + 3.2k RL)上のQwen3-8Bのトレーニングは、9OODベンチマークの平均点を+22に改善し、+68で最強の8Bベースラインを上回っている。興味深いことに、制御されたスケーリング分析は、多様性のスケーリングが、4倍少ないデータであっても、OOD一般化の量スケーリングを一貫して上回っていることを示している。

論文の概要: DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

関連論文リスト