Related papers: Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

URL: http://arxiv.org/abs/2506.13923v2
Date: Fri, 20 Jun 2025 00:51:15 GMT
Title: Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models
Authors: Vaskar Nath, Elaine Lau, Anisha Gunjal, Manasi Sharma, Nikhil Baharte, Sean Hendryx,
Abstract summary: We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems.<n>We find that RLVR drives performance in two main ways: (1) by compressing pass@$k$ into pass@1 and (2) via "capability gain" in which models learn to solve new problems that they previously could not solve even at high $k$.
Score: 3.207886496235499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems. We find that RLVR drives performance in two main ways: (1) by compressing pass@$k$ into pass@1 and (2) via "capability gain" in which models learn to solve new problems that they previously could not solve even at high $k$. We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation. We demonstrate these findings across model scales ranging from 0.5B to 72B parameters on >500,000 reasoning problems with prompts and verifiable final answers across math, science, and code domains. We further show that we can significantly improve pass@$k$ rates by leveraging natural language guidance for the model to consider within context while still requiring the model to derive a solution chain from scratch. Based of these insights, we derive $\text{Guide}$ -- a new class of online training algorithms. $\text{Guide}$ adaptively incorporates hints into the model's context on problems for which all rollouts were initially incorrect and adjusts the importance sampling ratio for the "off-policy" trajectories in order to optimize the policy for contexts in which the hints are no longer present. We describe variants of $\text{Guide}$ for GRPO and PPO and empirically show that Guide-GRPO on 7B and 32B parameter models improves generalization over its vanilla counterpart with up to 4$\%$ macro-average improvement across math benchmarks. We include careful ablations to analyze $\text{Guide}$'s components and theoretically analyze Guide's learning efficiency.

Related papers

Intention-Conditioned Flow Occupancy Models [69.79049994662591]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z)
Training Superior Sparse Autoencoders for Instruct Models [16.3663776969074]
We propose a novel training method specifically tailored for instruct models.<n>$textitFAST$ aligns the training process with the data distribution and activation patterns characteristic of instruct models.<n>In feature interpretability, $textitFAST$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1%$ scored in the top range, compared to $7.0%$ and $10.2%$ for $textitBT(P)$ and $textitBT(F)$.
arXiv Detail & Related papers (2025-06-09T12:23:34Z)
Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening [36.81125165911328]
Reinforcement learning is emerging as a primary driver for improving language model reasoning capabilities.<n>We investigate whether current reinforcement learning algorithms merely sharpen the base model's distribution around problems it can already solve.<n>We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings.
arXiv Detail & Related papers (2025-06-03T01:15:15Z)
Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws [52.10468229008941]
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting.<n>We provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model.<n>Building on these insights, we introduce a novel method for Contrastive Language-Image Pretraining with a reference model, termed DRRho-CLIP.
arXiv Detail & Related papers (2025-05-10T16:55:03Z)
Entropy-Based Adaptive Weighting for Self-Training [15.089334734753677]
We propose Entropy-Based Adaptive Weighting for Self-Training (EAST)<n>EAST is an adaptive weighting strategy designed to prioritize uncertain data during self-training.<n>We evaluate our approach on GSM8K and MATH benchmarks.
arXiv Detail & Related papers (2025-03-31T10:04:35Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling [52.34735382627312]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>Existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling.<n>We present T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
arXiv Detail & Related papers (2025-01-20T18:33:33Z)
Learning Goal-Conditioned Representations for Language Reward Models [10.94845204766088]
We propose training reward models (RMs) in a contrastive, $textitgoal-conditioned$ fashion. We show this way of training RM representations enables improved $textitsteerability$ because it allows us to evaluate the likelihood of an action achieving a particular goal-state. We additionally find that these representations can perform fine-grained control by conditioning on desired future goal-states.
arXiv Detail & Related papers (2024-07-18T20:23:11Z)
An Empirical Study of $μ$P Learning Rate Transfer [0.0]
We show that the $mu$-Transfer method can yield near-optimal learning rates in practice.<n>Despite its evident promise, the $mu$P method is not yet widely adopted.
arXiv Detail & Related papers (2024-04-08T17:59:44Z)
Scalable Learning of Item Response Theory Models [48.91265296134559]
Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data. We leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets.
arXiv Detail & Related papers (2024-03-01T17:12:53Z)
A Relational Intervention Approach for Unsupervised Dynamics Generalization in Model-Based Reinforcement Learning [113.75991721607174]
We introduce an interventional prediction module to estimate the probability of two estimated $hatz_i, hatz_j$ belonging to the same environment. We empirically show that $hatZ$ estimated by our method enjoy less redundant information than previous methods.
arXiv Detail & Related papers (2022-06-09T15:01:36Z)
Model-Augmented Q-learning [112.86795579978802]
We propose a MFRL framework that is augmented with the components of model-based RL. Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network. We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
arXiv Detail & Related papers (2021-02-07T17:56:50Z)
Improving Robustness and Generality of NLP Models Using Disentangled Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$. We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning. We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.