Related papers: Self-Hinting Language Models Enhance Reinforcement Learning

Self-Hinting Language Models Enhance Reinforcement Learning

URL: http://arxiv.org/abs/2602.03143v1
Date: Tue, 03 Feb 2026 05:56:20 GMT
Title: Self-Hinting Language Models Enhance Reinforcement Learning
Authors: Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian,
Abstract summary: We propose self-hint aligned GRPO with privileged supervision (SAGE)<n>SAGE injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward.<n> Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO.
Score: 37.311361929798714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group Relative Policy Optimization (GRPO) has recently emerged as a practical recipe for aligning large language models with verifiable objectives. However, under sparse terminal rewards, GRPO often stalls because rollouts within a group frequently receive identical rewards, causing relative advantages to collapse and updates to vanish. We propose self-hint aligned GRPO with privileged supervision (SAGE), an on-policy reinforcement learning framework that injects privileged hints during training to reshape the rollout distribution under the same terminal verifier reward. For each prompt $x$, the model samples a compact hint $h$ (e.g., a plan or decomposition) and then generates a solution $τ$ conditioned on $(x,h)$. Crucially, the task reward $R(x,τ)$ is unchanged; hints only increase within-group outcome diversity under finite sampling, preventing GRPO advantages from collapsing under sparse rewards. At test time, we set $h=\varnothing$ and deploy the no-hint policy without any privileged information. Moreover, sampling diverse self-hints serves as an adaptive curriculum that tracks the learner's bottlenecks more effectively than fixed hints from an initial policy or a stronger external model. Experiments over 6 benchmarks with 3 LLMs show that SAGE consistently outperforms GRPO, on average +2.0 on Llama-3.2-3B-Instruct, +1.2 on Qwen2.5-7B-Instruct and +1.3 on Qwen3-4B-Instruct. The code is available at https://github.com/BaohaoLiao/SAGE.

Related papers

iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z)
RC-GRPO: Reward-Conditioned Group Relative Policy Optimization for Multi-Turn Tool Calling Agents [40.88916135445381]
Multi-turn tool calling is challenging for Large Language Models because rewards are sparse and exploration is expensive.<n>A common recipe, SFT followed by GRPO, can stall when within-group reward variation is low.<n>We propose RC- GRPO, which treats exploration as a controllable steering problem via discrete reward tokens.
arXiv Detail & Related papers (2026-02-03T02:47:32Z)
SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models [67.41779761651924]
SOUP is a framework that unifies off- and on-policy learning within individual samples at the token level.<n>It consistently outperforms standard on-policy training and existing off-policy extensions.
arXiv Detail & Related papers (2026-01-29T09:56:15Z)
$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences [22.199479724764725]
We introduce a learnable parameter $lambda$ that adaptively controls token-level weighting.<n>We find that $lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO.<n>These gains come without any modifications to the training data or additional computational cost.
arXiv Detail & Related papers (2025-10-08T10:39:07Z)
GRPO-$λ$: Credit Assignment improves LLM Reasoning [35.452488047246646]
We present GRPO-$lambda$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks.<n>We compare GRPO-$lambda$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets.<n>With GRPO-$lambda$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.
arXiv Detail & Related papers (2025-09-30T19:11:10Z)
FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z)
Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [55.15106182268834]
Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models.<n>It faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive.<n>We introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts.
arXiv Detail & Related papers (2025-04-18T17:49:55Z)
Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification [10.617854230082896]
Group Relative Policy Optimization was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards.<n>We analyze variants that differ in reward normalization (mean-only vs mean + variance) and in how they regularize updates using KL divergence.
arXiv Detail & Related papers (2025-03-09T14:36:45Z)
Nearly Minimax Optimal Reward-free Reinforcement Learning [88.75843804630772]
We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. We give a new efficient algorithm, textbfStaged textbfSampling + textbfTruncated textbfPlanning (algoname), which interacts with the environment at most $Oleft( fracS2Aepsilon2textpolylogleft(fracSAHepsilon2
arXiv Detail & Related papers (2020-10-12T17:51:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.