Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
- URL: http://arxiv.org/abs/2505.14403v3
- Date: Mon, 14 Jul 2025 11:21:20 GMT
- Title: Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
- Authors: Zhaohui Yang, Yuxiao Ye, Shilei Jiang, Chen Hu, Linjing Li, Shihong Deng, Daxin Jiang,
- Abstract summary: We propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA)<n>BCPG-NSA is a fine-grained offline framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples.<n> Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset.
- Score: 47.058298511243386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.
Related papers
- Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - Dissecting Long Reasoning Models: An Empirical Study [94.31064312707211]
We systematically analyze the roles of positive and negative samples in reinforcement learning (RL)<n>We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage.<n>We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes.
arXiv Detail & Related papers (2025-06-05T11:47:10Z) - Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning [55.33984461046492]
Policy-based methods currently dominate reinforcement learning pipelines for large language model (LLM) reasoning.<n>We introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs.<n>We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy via an improved change-of-trajectory-measure analysis.
arXiv Detail & Related papers (2025-05-21T09:41:53Z) - Can LLM-Driven Hard Negative Sampling Empower Collaborative Filtering? Findings and Potentials [9.668242919588199]
Hard negative samples can accelerate model convergence and optimize decision boundaries.<n>This paper introduces the concept of Semantic Negative Sampling.<n>We propose a framework called HNLMRec, based on fine-tuning LLMs supervised by collaborative signals.
arXiv Detail & Related papers (2025-04-07T04:39:45Z) - Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs [15.806503459642665]
We propose a new algorithm for fine-tuning large language models using reinforcement learning.<n>We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency.<n>As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples.
arXiv Detail & Related papers (2025-03-18T14:23:37Z) - TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge [59.57934574562651]
TRACT (Two-stage Regression-Aware fine-tuning with CoT) is a method combining CoT reasoning with regression-aware training.<n>Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods.
arXiv Detail & Related papers (2025-03-06T12:33:20Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token.
We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains.
Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z) - Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning [20.491176017183044]
This paper tackles the multi-objective reinforcement learning (MORL) problem.
It introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals.
arXiv Detail & Related papers (2024-05-05T23:52:57Z) - PLReMix: Combating Noisy Labels with Pseudo-Label Relaxed Contrastive Representation Learning [7.556169113399857]
We propose an end-to-end textbfPLReMix framework by introducing a Pseudo-Label Relaxed (PLR) contrastive loss.<n>The proposed PLR loss is pluggable and we have integrated it into other LNL methods, observing their improved performance.
arXiv Detail & Related papers (2024-02-27T15:22:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.