Fugu-MT 論文翻訳(概要): Co-Evolving Skill Generation and Policy Optimization

論文の概要: Co-Evolving Skill Generation and Policy Optimization

arxiv url: http://arxiv.org/abs/2606.08755v1
Date: Sun, 07 Jun 2026 17:55:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.431165
Title: Co-Evolving Skill Generation and Policy Optimization
Title（参考訳）: 共進化型スキル生成と政策最適化
Authors: Zhiwei Zhang, Yudi Lin, Nikki Lijing Kuang, Linlin Wu, Xiaomin Li, Songtao Liu, Fenglong Ma,
Abstract要約: 既存の手法は通常、強力な言語モデルを使用してトラジェクトリを分析し、スキルを生成し、オンライントレーニング中に検索可能なスキルバンクを更新します。プレストレージスキル検証のためのオンライン強化学習フレームワークを提案する。
参考スコア（独自算出の注目度）: 35.41582114275514
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill's context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.
Abstract（参考訳）: スキル強化型強化学習は、過去の経験から得た再利用可能な手続き的知識を記憶することで、言語エージェントを改善する。既存の手法は通常、強力な言語モデルを使用してトラジェクトリを分析し、スキルを生成し、オンライントレーニング中に検索可能なスキルバンクを更新します。しかし、新しく生成されたスキルが保存され再利用される前に有用かどうかを評価することは滅多にない。プロプライエタリなフロンティアLLMが生み出すスキルでさえ、非常に混合したユーティリティを示しており、その多くがほとんど利益を提供しておらず、性能も劣化している。このようなスキルが銀行に入ると、その後のロールアウトのフィードバックが遅れ、通常、個々のスキルの限界的な貢献よりも複数のスキルの複合的な効果を反映するため、その効果を特定するのが困難になる。プレストレージスキル検証のためのオンライン強化学習フレームワークを提案する。フレームワークは、候補となるスキルが、現在のタスクのために既に取得したスキル以外の有用な情報に寄与するかどうかを推定する。標準的なロールアウト予算を使用して、同じタスクと検索コンテキストの下で2つのマッチしたグループを形成する:現在検索されているスキルに条件付けされたベースロールアウトと、同じスキルに条件付けされたスキル強化ロールアウトと、ベーストラジェクトリから誘導された1つの候補スキル。これら2つのグループ間の報酬ギャップは、候補スキルのコンテキスト依存の限界効用を推定し、フレームワークは、追加のロールアウトオーバーヘッドなしに、非有効または有害なスキルをフィルタリングしながら、有用なスキルを促進することができる。このフレームワークはさらに、この限界効用信号を使用して、ポリシー自体をスキルジェネレータとしてトレーニングし、プロプライエタリなモデルへの繰り返し呼び出しへの依存を減らす。学習したスキル生成の可能性は、ポリシーが進化するにつれて、検索時間の再ランク付けと時代遅れのスキルプルーニングの文脈依存スコアとして機能する。

論文の概要: Co-Evolving Skill Generation and Policy Optimization

関連論文リスト