Fugu-MT 論文翻訳(概要): SEIF: Self-Evolving Reinforcement Learning for Instruction Following

論文の概要: SEIF: Self-Evolving Reinforcement Learning for Instruction Following

arxiv url: http://arxiv.org/abs/2605.07465v1
Date: Fri, 08 May 2026 09:13:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.940672
Title: SEIF: Self-Evolving Reinforcement Learning for Instruction Following
Title（参考訳）: SEIF: 指導追従のための自己発展型強化学習
Authors: Qingyu Ren, Qianyu He, Jiajie Zhu, Xingzhou Chen, Jingwen Chang, Zeye Sun, Han Xia, Fei Yu, Jiaqing Liang, Yanghua Xiao,
Abstract要約: 大規模言語モデル(LLM)の指示追従能力を高める自己進化型フレームワークSEIFを提案する。 SEIFは閉じた自己進化ループを形成し、モデルの命令追従能力を改善する。複数のモデルスケールとアーキテクチャの実験により、SEIFは命令追従性能を一貫して改善し、強い汎用性を示唆している。
参考スコア（独自算出の注目度）: 53.280277743734096
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model's capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier-rq1/SEIF.
Abstract（参考訳）: 命令に従うことは、大きな言語モデル(LLM)の基本的な能力であるが、この能力を継続的に改善することは依然として困難である。既存の手法は、人間や強力な教師モデルからの高価な外部監督や、モデルの性能が向上するにつれて進化できない静的な命令による自己学習に頼っているのが一般的である。これらの制約に対処するため,LLMの自己発展型フレームワークであるSEIF(Self-Evolving Reinforcement Learning for Instruction following)を提案する。 SEIFは閉じた自己進化ループを形成し、モデルの命令追従能力を改善する。 SEIFは、ますます困難な命令を生成するインストラクタ、データ品質を保証するために競合または無効な命令を除去するフィルタ、進化した命令に従うことを学習するフォローア、強化学習のための報酬信号を提供するジャッジアの4つの役割で構成されている。インストラクターとフォロワーは交互に訓練され、プロセスを通して共同開発される。複数のモデルスケールとアーキテクチャの実験により、SEIFは命令追従性能を一貫して改善し、強い汎用性を示唆している。さらに分析は、改善の源を明らかにするとともに、オープンエンドタスクにおける自己進化のための効果的なトレーニング戦略を特定する: しっかりとした基盤を構築するのに十分な初期段階のトレーニング、そして過度な適合を緩和し、最終的なパフォーマンスを達成するために適度な後期のトレーニングである。コードとデータはhttps://github.com/Rainier-rq1/SEIFで公開されている。

論文の概要: SEIF: Self-Evolving Reinforcement Learning for Instruction Following

関連論文リスト