Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
- URL: http://arxiv.org/abs/2510.14420v1
- Date: Thu, 16 Oct 2025 08:24:44 GMT
- Title: Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
- Authors: Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu,
- Abstract summary: Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications.<n>We propose a label-free self-supervised reinforcement learning framework that eliminates dependency on external supervision.<n>Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges.
- Score: 58.60470643433354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if
Related papers
- Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following [37.69688837528397]
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities.<n>We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision.
arXiv Detail & Related papers (2025-08-04T07:48:59Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [65.56911325914582]
We propose Self-play Reinforcement Learning (SeRL) to bootstrap Large Language Models (LLMs) training with limited initial data.<n>The proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards.
arXiv Detail & Related papers (2025-05-25T13:28:04Z) - Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z) - URLB: Unsupervised Reinforcement Learning Benchmark [82.36060735454647]
We introduce the Unsupervised Reinforcement Learning Benchmark (URLB)
URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards.
We provide twelve continuous control tasks from three domains for evaluation and open-source code for eight leading unsupervised RL methods.
arXiv Detail & Related papers (2021-10-28T15:07:01Z) - Robust Disentanglement of a Few Factors at a Time [5.156484100374058]
We introduce population-based training (PBT) for improving consistency in training variational autoencoders (VAEs)
We then use Unsupervised Disentanglement Ranking (UDR) as an unsupervised to score models in our PBT-VAE training and show how models trained this way tend to consistently disentangle only a subset of the generative factors.
We show striking improvement in state-of-the-art unsupervised disentanglement performance and robustness across multiple datasets and metrics.
arXiv Detail & Related papers (2020-10-26T12:34:23Z) - Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks.
Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL.
Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.