Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning
- URL: http://arxiv.org/abs/2602.03190v2
- Date: Thu, 05 Feb 2026 16:51:08 GMT
- Title: Prompt Augmentation Scales up GRPO Training on Mathematical Reasoning
- Authors: Wenquan Lu, Hai Huang, Randall Balestriero,
- Abstract summary: We introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats.<n>We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset.<n>A Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance.
- Score: 19.22530791401551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning algorithms such as group-relative policy optimization (GRPO) have demonstrated strong potential for improving the mathematical reasoning capabilities of large language models. However, prior work has consistently observed an entropy collapse phenomenon during reinforcement post-training, characterized by a monotonic decrease in policy entropy that ultimately leads to training instability and collapse. As a result, most existing approaches restrict training to short horizons (typically 5-20 epochs), limiting sustained exploration and hindering further policy improvement. In addition, nearly all prior work relies on a single, fixed reasoning prompt or template during training. In this work, we introduce prompt augmentation, a training strategy that instructs the model to generate reasoning traces under diverse templates and formats, thereby increasing rollout diversity. We show that, without a KL regularization term, prompt augmentation enables stable scaling of training duration under a fixed dataset and allows the model to tolerate low-entropy regimes without premature collapse. Empirically, a Qwen2.5-Math-1.5B model trained with prompt augmentation on the MATH Level 3-5 dataset achieves state-of-the-art performance, reaching 45.2 per-benchmark accuracy and 51.8 per-question accuracy on standard mathematical reasoning benchmarks, including AIME24, AMC, MATH500, Minerva, and OlympiadBench. The code and model checkpoints are available at https://github.com/wenquanlu/prompt-augmentation-GRPO.
Related papers
- MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration [48.446476072756276]
Training instability remains a critical challenge in large language model pretraining.<n>We study training failures in a 5M NanoGPT model scaled via $$P.<n>We propose MSign, a new norm that periodically applies matrix sign operations to restore stable rank.
arXiv Detail & Related papers (2026-02-02T07:18:45Z) - DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning [31.369103012768964]
DISPO is a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses.<n>We show that DISPO achieves 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.
arXiv Detail & Related papers (2026-02-01T02:45:04Z) - JustRL: Scaling a 1.5B LLM with a Simple RL Recipe [45.42398283391072]
Single-stage training achieves state-of-the-art performance on two 1.5B reasoning models.<n>Training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions.
arXiv Detail & Related papers (2025-12-18T15:21:25Z) - Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning [49.290631188365786]
Scaf-GRPO is a training framework that intervenes when a model's independent learning has plateaued.<n>It boosts the pass@1 score of the Qwen2.5-Math-7B model by a relative 44.3% over a vanilla GRPO baseline.<n>This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach.
arXiv Detail & Related papers (2025-10-22T17:41:30Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models [62.82372407840088]
Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools.<n>textbfReshaped textbfToken-level policy gradients (textbfResT) for tool-use tasks.<n>textbfResT achieves state-of-the-art results, outperforming prior methods by up to $8.76%$.
arXiv Detail & Related papers (2025-09-26T03:38:27Z) - Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning [33.899779762210976]
Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem.<n>Existing methods mitigate this issue with KL penalties or clipping, which passively updates rather than actively reducing the gap.<n>We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap before training.
arXiv Detail & Related papers (2025-09-18T17:02:30Z) - Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling [35.97270347306353]
We propose textbfEGPO, a new RL framework built upon Group Relative Policy Optimization (GRPO)<n>The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient.<n>On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B- parameter model trained with EGPO sets a new state-of-the-art among models of comparable size.
arXiv Detail & Related papers (2025-08-07T07:51:38Z) - On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z) - Discriminative Policy Optimization for Token-Level Reward Models [55.98642069903191]
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs)<n>Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations.<n>Reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH.
arXiv Detail & Related papers (2025-05-29T11:40:34Z) - Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space [92.6187727249868]
We introduce LatentSeek, a framework that enhances reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space.<n>LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024.<n>Results show that LatentSeek consistently outperforms strong baselines.
arXiv Detail & Related papers (2025-05-19T16:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.