Fugu-MT 論文翻訳(概要): Skill-R1: Agent Skill Evolution via Reinforcement Learning

論文の概要: Skill-R1: Agent Skill Evolution via Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.09359v1
Date: Sun, 10 May 2026 06:19:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.211342
Title: Skill-R1: Agent Skill Evolution via Reinforcement Learning
Title（参考訳）: Skill-R1:強化学習によるエージェントスキル進化
Authors: Yash Vishe, Rohan Surana, Xunyi Jiang, Zihan Huang, Xintong Li, Nikki Lijing Kuang, Tong Yu, Ryan A. Rossi, Jingbo Shang, Julian McAuley, Junda Wu,
Abstract要約: Skill-R1は、検証可能な報酬からインスタンスレベルの繰り返しスキル最適化のための強化学習フレームワークである。オープンソースモデルとクローズドソースモデルの両方とのブラックボックス互換性を維持しつつ、モデルレベルの更新よりも大幅に安価に適応できる。
参考スコア（独自算出の注目度）: 84.35984979949502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic large language models often rely on skills, reusable natural language procedures that guide planning, action, and tool use. In practice, skills are typically improved through prompt engineering or by aligning the task LLM itself, which is costly, model-specific, and often infeasible for closed-source models. Skill optimization is not a one-step problem but a recurrent process with two coupled levels of credit assignment: a useful skill must improve rollout quality under current conditioning, while a useful revision must turn observed outcomes into a better skill for the next round. We propose Skill-R1, a reinforcement learning framework for instance-level recurrent skill optimization from verifiable rewards. Rather than updating the task LLM, Skill-R1 trains a lightweight skill generator that conditions on the task context, prior rollouts, and their verified outcomes to produce skills that steer a frozen task LLM. This preserves black-box compatibility with both open- and closed-source models while making adaptation substantially cheaper than model-level updates. Skill-R1 proceeds over multiple generations: at each step, the current skill induces rollouts whose verified outcomes are fed back to produce the next revision. To optimize this recurrent process, we introduce a bi-level group-relative policy optimization objective combining intra-generation and inter-generation advantages. The intra-generation term compares rollouts under shared skill conditioning, while the inter-generation term rewards revisions that improve behavior across successive generations. Together, these provide a principled objective for directional skill evolution rather than one-shot self-refinement. Empirically, Skill-R1 achieves consistent gains over no-skill baselines and standard GRPO across benchmarks with verifiable rewards, with particularly strong improvements on complex, multi-step tasks.
Abstract（参考訳）: エージェント型大規模言語モデルは、しばしば、計画、アクション、ツールの使用をガイドする、スキル、再利用可能な自然言語プロシージャに依存している。実際には、スキルは、プロンプトエンジニアリングや、コストがかかり、モデル固有であり、しばしばクローズドソースモデルでは実現不可能なタスクLLM自体を調整することで改善される。スキル最適化は1段階の問題ではなく、2段階のクレジット割り当てが組み合わされた繰り返しプロセスである: 有用なスキルは、現在の条件付けの下でロールアウト品質を改善する必要があるが、有用なリビジョンは、観測結果を次のラウンドでより良いスキルに変換する必要がある。 Skill-R1は、検証可能な報酬から、インスタンスレベルの繰り返しスキル最適化のための強化学習フレームワークである。タスクLLMを更新する代わりに、Skill-R1はタスクコンテキスト、事前のロールアウト、そしてその検証結果に基づいて、フリーズタスクLLMを操縦するスキルを生成する軽量なスキルジェネレータを訓練する。これにより、オープンソースモデルとクローズドソースモデルの両方とのブラックボックス互換性を保ちながら、モデルレベルの更新よりも大幅に安くなる。 Skill-R1は数世代にわたって進行し、各ステップで現在のスキルがロールアウトを誘導し、検証結果が返ってくると次のリビジョンが生成される。この再帰過程を最適化するために,世代内および世代間優位性を組み合わせた2段階のグループ相対的政策最適化手法を提案する。世代内用語は、共有スキル条件下でのロールアウトを比較し、世代間用語は、世代間行動を改善するためのリビジョンを報いる。これらは一発の自己複製ではなく、方向性のスキル進化のための原則化された目標を提供する。実証的には、Skill-R1は非スキルベースラインと標準GRPOよりも一貫したゲインを、検証可能な報酬を持つベンチマークで達成している。

論文の概要: Skill-R1: Agent Skill Evolution via Reinforcement Learning

関連論文リスト