RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization
- URL: http://arxiv.org/abs/2510.02172v1
- Date: Thu, 02 Oct 2025 16:24:01 GMT
- Title: RESTRAIN: From Spurious Votes to Signals -- Self-Driven RL with Self-Penalization
- Authors: Zhaoning Yu, Will Su, Leitian Tao, Haozhu Wang, Aashu Singh, Hanchao Yu, Jianyu Wang, Hongyang Gao, Weizhe Yuan, Jason Weston, Ping Yu, Jing Xu,
- Abstract summary: We introduce RESTRAIN, a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal.<n>Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution.<n>On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data.
- Score: 52.01526898310723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce RESTRAIN (REinforcement learning with Self-restraint), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model's entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. The self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it improves Pass@1 by up to +140.7 percent on AIME25, +36.2 percent on MMLU_STEM, and +19.6 percent on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.
Related papers
- TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning [33.47825979936341]
Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs)<n>We propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones.<n>With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%)
arXiv Detail & Related papers (2025-12-15T09:03:45Z) - Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models [31.422773877490613]
Reasoning LLMs (RLMs) deliver strong multi-step reasoning through chain-of-thought generation.<n>RLMs' large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings.<n>We introduce RESP, a structured pruning framework that aligns pruning decisions with the model's reasoning dynamics.
arXiv Detail & Related papers (2025-12-01T20:27:05Z) - Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values [53.72318444646282]
We propose Reinforcement Learning with Explicit Human Values (RLEV)<n>RLEV aligns Large Language Model (LLM) optimization directly with quantifiable human value signals.<n>We show RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales.
arXiv Detail & Related papers (2025-10-23T04:15:22Z) - Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation [74.75716642635484]
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR)<n>We propose EVOL-RL, a label-free framework that mirrors the evolutionary principle of balancing selection with variation.<n>EVOL-RL consistently outperforms the majority-only baseline.
arXiv Detail & Related papers (2025-09-18T17:50:04Z) - Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement [29.474742920809565]
Reinforcement learning with verifiable rewards (RLVR) shows promise in improving the reasoning ability of large language models (LLMs)<n>We propose textitCo-Reward, a novel RL framework that leverages contrastive agreement across semantically analogical questions as a reward basis.
arXiv Detail & Related papers (2025-08-01T08:09:14Z) - One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z) - The Achilles Heel of AI: Fundamentals of Risk-Aware Training Data for High-Consequence Models [0.0]
AI systems in high-consequence domains must detect rare, high-impact events while operating under tight resource constraints.<n>Traditional annotation strategies that prioritize label volume over informational value introduce redundancy and noise.<n>This paper introduces smart-sizing, a training data strategy that emphasizes label diversity, model-guided selection, and marginal utility-based stopping.
arXiv Detail & Related papers (2025-05-20T22:57:35Z) - Reward Modeling with Weak Supervision for Language Models [12.599789817157188]
This work introduces weak supervision as a strategy to extend RLHF datasets and enhance reward model performance.
By analyzing RLHF datasets to identify imprecise responses, we wrote simple labeling functions and then calibrated a label model to weakly unlabeled data.
Our evaluation show that while weak supervision significantly benefits smaller datasets by improving reward model performance, its effectiveness decreases with larger, originally labeled datasets.
arXiv Detail & Related papers (2024-10-28T09:37:58Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - How to Leverage Unlabeled Data in Offline Reinforcement Learning [125.72601809192365]
offline reinforcement learning (RL) can learn control policies from static datasets but, like standard RL methods, it requires reward annotations for every transition.
One natural solution is to learn a reward function from the labeled data and use it to label the unlabeled data.
We find that, perhaps surprisingly, a much simpler method that simply applies zero rewards to unlabeled data leads to effective data sharing.
arXiv Detail & Related papers (2022-02-03T18:04:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.