Related papers: Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning

URL: http://arxiv.org/abs/2505.14403v4
Date: Mon, 15 Sep 2025 14:23:10 GMT
Title: Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
Authors: Zhaohui Yang, Yuxiao Ye, Shilei Jiang, Chen Hu, Linjing Li, Shihong Deng, Daxin Jiang,
Abstract summary: We propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA)<n>BCPG-NSA is a fine-grained offline framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples.<n> Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset.
Score: 41.83677588934301
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
Improving LLM-based Recommendation with Self-Hard Negatives from Intermediate Layers [80.55429742713623]
ILRec is a novel preference fine-tuning framework for LLM-based recommender systems.<n>We introduce a lightweight collaborative filtering model to assign token-level rewards for negative signals.<n>Experiments on three datasets demonstrate ILRec's effectiveness in enhancing the performance of LLM-based recommender systems.
arXiv Detail & Related papers (2026-02-19T14:37:43Z)
GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning [14.111530312590531]
Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI)<n>We propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO)<n>GDEPO incorporates three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases.
arXiv Detail & Related papers (2026-01-11T07:34:41Z)
VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL [38.782188833641676]
Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models.<n>They suffer from a critical emphgradient vanishing problem when all responses within a group receive identical rewards.<n>We propose textbfVADE, a sampling framework via online sample-level difficulty textbfEstimation.
arXiv Detail & Related papers (2025-11-24T08:59:54Z)
ASPO: Asymmetric Importance Sampling Policy Optimization [31.38346888572171]
The Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens.<n>This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones.<n>We propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens.
arXiv Detail & Related papers (2025-10-07T15:54:24Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
Dissecting Long Reasoning Models: An Empirical Study [94.31064312707211]
We systematically analyze the roles of positive and negative samples in reinforcement learning (RL)<n>We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage.<n>We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes.
arXiv Detail & Related papers (2025-06-05T11:47:10Z)
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning [55.33984461046492]
Policy-based methods currently dominate reinforcement learning pipelines for large language model (LLM) reasoning.<n>We introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs.<n>We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy via an improved change-of-trajectory-measure analysis.
arXiv Detail & Related papers (2025-05-21T09:41:53Z)
Can LLM-Driven Hard Negative Sampling Empower Collaborative Filtering? Findings and Potentials [9.668242919588199]
Hard negative samples can accelerate model convergence and optimize decision boundaries.<n>This paper introduces the concept of Semantic Negative Sampling.<n>We propose a framework called HNLMRec, based on fine-tuning LLMs supervised by collaborative signals.
arXiv Detail & Related papers (2025-04-07T04:39:45Z)
Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs [15.806503459642665]
We propose a new algorithm for fine-tuning large language models using reinforcement learning.<n>We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency.<n>As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples.
arXiv Detail & Related papers (2025-03-18T14:23:37Z)
TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge [59.57934574562651]
TRACT (Two-stage Regression-Aware fine-tuning with CoT) is a method combining CoT reasoning with regression-aware training.<n>Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods.
arXiv Detail & Related papers (2025-03-06T12:33:20Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token. We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains. Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z)
Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning [20.491176017183044]
This paper tackles the multi-objective reinforcement learning (MORL) problem. It introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals.
arXiv Detail & Related papers (2024-05-05T23:52:57Z)
PLReMix: Combating Noisy Labels with Pseudo-Label Relaxed Contrastive Representation Learning [7.556169113399857]
We propose an end-to-end textbfPLReMix framework by introducing a Pseudo-Label Relaxed (PLR) contrastive loss.<n>The proposed PLR loss is pluggable and we have integrated it into other LNL methods, observing their improved performance.
arXiv Detail & Related papers (2024-02-27T15:22:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.