Fugu-MT 論文翻訳(概要): Revisiting DAgger in the Era of LLM-Agents

論文の概要: Revisiting DAgger in the Era of LLM-Agents

arxiv url: http://arxiv.org/abs/2605.12913v1
Date: Wed, 13 May 2026 02:40:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.766697
Title: Revisiting DAgger in the Era of LLM-Agents
Title（参考訳）: LLM-Agents時代のダガー再考
Authors: Changhao Li, Rushi Qiang, Jiawei Huang, Chenxiao Gao, Chao Zhang, Niao He, Bo Dai,
Abstract要約: ロングホライゾン LM エージェントはマルチターン相互作用から学習し、1つの早期誤りがその後の状態分布を変化させ、全軌道を脱線させる。教師の微調整によって教師の監督が密集し、検証可能な報酬による強化学習は、この非政治的なミスマッチを避ける。マルチターンLMエージェントのデータセットアグリゲーション(DAgger)を再検討することにより,このジレンマに対処する。
参考スコア（独自算出の注目度）: 35.615579397673166
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher's behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software-engineering agent with 4B- and 8B-scale student models. On SWE-bench Verified, our DAgger-style training improves over the strongest post-training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE-agent systems, while the 8B agent achieves 29.8%, surpassing SWE-Gym-32B and coming within 5 points of stronger 32B-scale agents. Together with consistent gains on the held-out SWE-Gym split, these results suggest the effectiveness of DAgger for modern long-horizon LM agents.
Abstract（参考訳）: ロングホライゾン LM エージェントはマルチターン相互作用から学習し、1つの早期誤りがその後の状態分布を変化させ、全軌道を脱線させる。教師の微調整は密集した教師の監督を提供するが、法外な教師の軌道で訓練されているため、共変的なシフトに苦しむ。このジレンマには,マルチターンLMエージェントのデータセット集約(DAgger)を再検討することにより対処する。このアルゴリズムは,学生と教師の方針のターンレベルの補間を通じてトラジェクトリを収集し,教師が提供した教師ラベルを用いてこれらのトラジェクトリを訓練する。環境と直接対話することで、デプロイ中に遭遇する可能性のある現実的な状態にモデルを公開することで、共変量シフトを効果的に軽減する。また、生徒は教師の行動を模倣して学習するため、学習中に豊富なフィードバックを受ける。 DAggerが両方の世界の利益を享受することを示すため,ソフトウェア工学エージェントを4Bおよび8Bスケールの学生モデルで訓練するアルゴリズムを検証した。 SWEベンチ検証では,最強のポストトレーニングベースラインを4Bで+3.9点,8Bで+3.6点に改善した。 4Bエージェントは27.3%に達し、8Bエージェントは29.8%、SWE-Gym-32Bを上回り、32Bスケールの強いエージェントの5ポイント以内に到達する。これらの結果は,SWE-Gym分割における一貫した利得とともに,DAggerの有効性が示唆された。

論文の概要: Revisiting DAgger in the Era of LLM-Agents

関連論文リスト