Fugu-MT 論文翻訳(概要): What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

論文の概要: What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

arxiv url: http://arxiv.org/abs/2605.19447v1
Date: Tue, 19 May 2026 07:00:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.175513
Title: What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
Title（参考訳）: 何といつ蒸留するか:多孔質剤の選択的近視用蒸留法
Authors: Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen,
Abstract要約: 強化学習は、緩やかなタスク報酬からLLMエージェントを訓練することができるが、長期的なクレジット割り当ては依然として困難である。既存の方法は、ステップごとの環境フィードバックを完全に活用することなく、軌道レベルの報酬やプロキシ信号に依存している。環境重み付け学習フレームワークであるSERLを紹介する。
参考スコア（独自算出の注目度）: 70.6980022118038
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.
Abstract（参考訳）: 強化学習は、緩やかなタスク報酬からLLMエージェントを訓練することができるが、長距離クレジットの割り当ては困難なままである。既存の方法は、ステップごとの環境フィードバックを完全に活用することなく、軌道レベルの報酬やプロキシ信号に依存している。マルチターンエージェント設定は、エラーメッセージ、ページの変更、観察、参照トラジェクトリを含むことができる。 5つのフィードバック源と2つの挿入粒度を体系的に研究し、選択的な環境重み付け学習フレームワークであるSERLを導入する。 SERLはタスク報酬を使用して更新方向を決定する一方、環境フィードバックは配置と大きさを調整し、重要なアクションにフォーカスする。 ALFWorldとWebShopでは、SERLは90.0%と80.1%の成功を達成し、強いRLと蒸留ベースラインを上回っている。分析によると、意味のある点における行動関連フィードバックは、より長いコンテキストやよりリッチなコンテキストの無差別な使用よりも一貫して優れている。

論文の概要: What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

関連論文リスト