Fugu-MT 論文翻訳(概要): Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

論文の概要: Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

arxiv url: http://arxiv.org/abs/2509.15194v1
Date: Thu, 18 Sep 2025 17:50:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.371821
Title: Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Title（参考訳）: ラベルのない言語モデルを進化させる: 多数派が選択を駆動し、新規性は変化を促進する
Authors: Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu,
Abstract要約: EVOL-RL(EVolution-Oriented and Label-free Reinforcement Learning)を提案する。 EVOL-RLは、多数投票された回答を安定したアンカーとして保持する(選択) 既に作られたもの(変種)と理性が異なる応答を好む斬新な報酬を加える。
参考スコア（独自算出の注目度）: 74.75716642635484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model's inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.
Abstract（参考訳）: 大規模言語モデル(LLM)は、検証可能な報酬(RLVR)から強化学習によって、ますます訓練されている。既存のラベルのない方法、信頼性の最小化、自己整合性、多数決の目的、学習を安定させるが、探索を着実に縮小し、エントロピーの崩壊を引き起こす。 TTRL(Test-Time Reinforcement Learning)のような従来のアプローチとは違って,モデル固有の探索能力と一般化能力,すなわち進化を犠牲にすることなく,汎用的な改善を可能にする,という目標を掲げています。本稿では,この問題を定式化し,ラベルフリー環境下での安定性と変動を結合する簡単なルールであるEVOL-RLを提案する。 EVOL-RLは、多数決された回答を安定なアンカー(選択)として保持し、また、既に生成されているもの(変量)と異なる応答(意味空間で測定された応答)を優先する、新規に認識された報酬を追加する。 GRPOで実装されたEVOL-RLは、強い信号を保存するために非対称クリッピングと、探索を維持するためにエントロピー正規化器を使用する。この多数選択+新規変更の設計は、崩壊を防ぎ、より長く、より情報的な思考の連鎖を維持し、pass@1とpass@nの両方を改善します。例えば、ラベルのないAIME24のトレーニングでは、Qwen3-4B-Base AIME25 pass@1をTTRLの4.6%から16.4%、pass@16を18.5%から37.9%に引き上げている。 EVOL-RLは多様性の崩壊を防ぐだけでなく、ドメイン間のより強い一般化(GPQAなど)を解放する。さらに,EVOL-RLはRLVR設定の性能も向上することを示した。

論文の概要: Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

関連論文リスト