Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
- URL: http://arxiv.org/abs/2509.15194v2
- Date: Wed, 01 Oct 2025 05:29:42 GMT
- Title: Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
- Authors: Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu,
- Abstract summary: Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR)<n>We propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation.<n>EVOL-RL consistently outperforms the majority-only baseline.
- Score: 74.75716642635484
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing self-improvement approaches primarily rely on self-confirmation signals (e.g., confidence, entropy, or consistency) to generate rewards. This reliance drives models toward over-confident, majority-favored solutions, causing an entropy collapse that degrades pass@n and reasoning complexity. To address this, we propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation. Concretely, EVOL-RL retains the majority-voted answer as an anchor for stability, but adds a novelty-aware reward that scores each sampled solution by how different its reasoning is from other concurrently generated responses. This majority-for-stability + novelty-for-exploration rule mirrors the variation-selection principle: selection prevents drift, while novelty prevents collapse. Evaluation results show that EVOL-RL consistently outperforms the majority-only baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from baseline's 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents in-domain diversity collapse but also improves out-of-domain generalization (from math reasoning to broader tasks, e.g., GPQA, MMLU-Pro, and BBEH). The code is available at: https://github.com/YujunZhou/EVOL-RL.
Related papers
- Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting [44.347846669388446]
We propose EVOL-SAM3, a zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process.<n>EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting.
arXiv Detail & Related papers (2025-12-31T08:10:03Z) - RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization [52.01526898310723]
We introduce RESTRAIN, a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal.<n>Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution.<n>On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data.
arXiv Detail & Related papers (2025-10-02T16:24:01Z) - The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward [58.559544190947584]
A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance.<n>This is often accompanied by catastrophic forgetting, where models lose previously acquired skills.<n>We argue that standard RLVR objectives lack a crucial mechanism for knowledge retention.
arXiv Detail & Related papers (2025-09-09T06:34:32Z) - The Majority is not always right: RL training for solution aggregation [53.1050856072799]
We train an aggregator model to review, reconcile, and synthesize a final, correct answer.<n>A key ingredient is careful balancing of easy and hard training examples.<n>We find our method, AggLM, outperforms both strong rule-based and reward-model baselines.
arXiv Detail & Related papers (2025-09-08T16:39:38Z) - Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR [102.05010188302428]
We propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training.<n>This strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR.
arXiv Detail & Related papers (2025-08-19T17:42:45Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play [52.3079697845254]
eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training.<n>We show eva can create effective RL curricula and is robust across ablations.
arXiv Detail & Related papers (2024-10-31T08:15:32Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - Supplementing Gradient-Based Reinforcement Learning with Simple
Evolutionary Ideas [4.873362301533824]
We present a simple, sample-efficient algorithm for introducing large but directed learning steps in reinforcement learning (RL)
The methodology uses a population of RL agents training with a common experience buffer, with occasional crossovers and mutations of the agents in order to search efficiently through the policy space.
arXiv Detail & Related papers (2023-05-10T09:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.