POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration
- URL: http://arxiv.org/abs/2601.18779v1
- Date: Mon, 26 Jan 2026 18:47:21 GMT
- Title: POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration
- Authors: Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, Aviral Kumar,
- Abstract summary: On-policyReinforcement learning (RL) rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement.<n>We introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems.<n>POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts.
- Score: 78.9858758758376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) has improved the reasoning abilities of large language models (LLMs), yet state-of-the-art methods still fail to learn on many training problems. On hard problems, on-policy RL rarely explores even a single correct rollout, yielding zero reward and no learning signal for driving improvement. We find that natural solutions to remedy this exploration problem from classical RL, such as entropy bonuses, more permissive clipping of the importance ratio, or direct optimization of pass@k objectives, do not resolve this issue and often destabilize optimization without improving solvability. A natural alternative is to leverage transfer from easier problems. However, we show that mixing easy and hard problems during RL training is counterproductive due to ray interference, where optimization focuses on already-solvable problems in a way that actively inhibits progress on harder ones. To address this challenge, we introduce Privileged On-Policy Exploration (POPE), an approach that leverages human- or other oracle solutions as privileged information to guide exploration on hard problems, unlike methods that use oracle solutions as training targets (e.g., off-policy RL methods or warmstarting from SFT). POPE augments hard problems with prefixes of oracle solutions, enabling RL to obtain non-zero rewards during guided rollouts. Crucially, the resulting behaviors transfer back to the original, unguided problems through a synergy between instruction-following and reasoning. Empirically, POPE expands the set of solvable problems and substantially improves performance on challenging reasoning benchmarks.
Related papers
- Learn Hard Problems During RL with Reference Guided Fine-tuning [56.56461712665904]
Reinforcement learning (RL) for mathematical reasoning can suffer from reward sparsity.<n>We introduce Reference-Guided Fine-Tuning (ReGFT) to synthesize positive trajectories on hard problems and train on them before RL.<n>Our results show that ReGFT effectively overcomes reward sparsity and unlocks stronger RL-based mathematical reasoning.
arXiv Detail & Related papers (2026-03-01T18:41:28Z) - Reuse your FLOPs: Scaling RL on Hard Problems by Conditioning on Very Off-Policy Prefixes [22.721425502443253]
We introduce PrefixRL, where we condition on the prefix of successful off-policy traces and run on-policy RL to complete them.<n>PrefixRL boosts the learning signal on hard problems by modulating the difficulty of the problem through the off-policy prefix length.<n>We prove that the PrefixRL objective is not only consistent with the standard RL objective but also more efficient.
arXiv Detail & Related papers (2026-01-26T18:57:00Z) - Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs [126.45104018441698]
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs)<n>We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions.<n>We propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies.
arXiv Detail & Related papers (2026-01-13T17:48:43Z) - Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z) - Decomposing Elements of Problem Solving: What "Math" Does RL Teach? [22.517954679764244]
We decompose problem solving into fundamental capabilities: Plan, Execute, and Verify.<n>We show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills.<n>Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers.
arXiv Detail & Related papers (2025-05-28T18:18:49Z) - Solving Bayesian inverse problems with diffusion priors and off-policy RL [86.65351676007721]
Relative Trajectory Balance (RTB) is an off-policy reinforcement learning objective that canally solve inverse problems optimally.<n>We extend the original work by using RTB to train conditional diffusion model posteriors from pretrained unconditional priors for challenging linear and non-linear inverse problems in vision, and science.
arXiv Detail & Related papers (2025-03-12T18:45:22Z) - Decoupled Prioritized Resampling for Offline RL [114.73666323173204]
We propose Offline Prioritized Experience Replay (OPER) for offline reinforcement learning.<n>OPER features a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training.<n>We show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - Query-Policy Misalignment in Preference-Based Reinforcement Learning [21.212703100030478]
We show that the seemingly informative queries selected to improve the overall quality of reward model actually may not align with RL agents' interests.
We show that this issue can be effectively addressed via near on-policy query and a specially designed hybrid experience replay.
Our method achieves substantial gains in both human feedback and RL sample efficiency.
arXiv Detail & Related papers (2023-05-27T07:55:17Z) - Learning Vehicle Routing Problems using Policy Optimisation [4.093722933440819]
State-of-the-art approaches learn a policy using reinforcement learning, and the learnt policy acts as a pseudo solver.
These approaches have demonstrated good performance in some cases, but given the large search space typical of routing problem, they can converge too quickly to poor policy.
We propose entropy regularised reinforcement learning (ERRL) that supports exploration by providing more policies.
arXiv Detail & Related papers (2020-12-24T14:18:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.