Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
- URL: http://arxiv.org/abs/2512.21625v1
- Date: Thu, 25 Dec 2025 11:15:46 GMT
- Title: Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
- Authors: Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, Jun Zhou,
- Abstract summary: We investigate how sample polarities affect RLVR training dynamics and behaviors.<n>We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths.<n>We propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization.
- Score: 57.11130904745293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.
Related papers
- A Simple yet Effective Negative Sampling Plugin for Constructing Positive Sample Pairs in Implicit Collaborative Filtering [40.89512526196666]
PSP-NS is a negative sampling plugin for collaborative filtering.<n>It builds a user-item bipartite graph with edge weights indicating interaction confidence.<n>It generates positive sample pairs via replication-based reweighting to strengthen positive signals.<n> PSP-NS boosts Recall@30 and Precision@30 by 32.11% and 22.90% on Yelp over the strongest baselines.
arXiv Detail & Related papers (2026-02-20T13:34:43Z) - Improving LLM-based Recommendation with Self-Hard Negatives from Intermediate Layers [80.55429742713623]
ILRec is a novel preference fine-tuning framework for LLM-based recommender systems.<n>We introduce a lightweight collaborative filtering model to assign token-level rewards for negative signals.<n>Experiments on three datasets demonstrate ILRec's effectiveness in enhancing the performance of LLM-based recommender systems.
arXiv Detail & Related papers (2026-02-19T14:37:43Z) - Causal Negative Sampling via Diffusion Model for Out-of-Distribution Recommendation [7.354459720418281]
Heuristic negative sampling enhances recommendation performance by selecting negative samples of varying hardness levels from predefined candidate pools.<n>Unobserved environmental confounders in candidate pools may cause sampling methods to introduce false hard negatives (FHNS)<n>We propose a novel method named Causal Negative Sampling via Diffusion (CNSDiff) to address this issue.
arXiv Detail & Related papers (2025-08-10T08:55:21Z) - Dissecting Long-Chain-of-Thought Reasoning Models: An Empirical Study [91.78803511141975]
This work focuses on the roles of positive and negative samples in scaling reinforcement learning.<n>We identify substantial data inefficiency in group relative policy optimization, where over half of the samples yield zero advantage.<n>We investigate unstable performance across various reasoning models and benchmarks, attributing instability to uncertain problems with ambiguous outcomes.
arXiv Detail & Related papers (2025-06-05T11:47:10Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework.
We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z) - Understanding and Achieving Efficient Robustness with Adversarial
Contrastive Learning [34.97017489872795]
Adversarial Supervised Contrastive Learning (ASCL) approach outperforms the state-of-the-art defenses by $2.6%$ in terms of the robust accuracy.
Our ASCL with the proposed selection strategy can further gain $1.4%$ improvement with only $42.8%$ positives and $6.3%$ negatives compared with ASCL without a selection strategy.
arXiv Detail & Related papers (2021-01-25T11:57:52Z) - Understanding Negative Sampling in Graph Representation Learning [87.35038268508414]
We show that negative sampling is as important as positive sampling in determining the optimization objective and the resulted variance.
We propose Metropolis-Hastings (MCNS) to approximate the positive distribution with self-contrast approximation and accelerate negative sampling by Metropolis-Hastings.
We evaluate our method on 5 datasets that cover extensive downstream graph learning tasks, including link prediction, node classification and personalized recommendation.
arXiv Detail & Related papers (2020-05-20T06:25:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.