MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning
- URL: http://arxiv.org/abs/2602.17550v2
- Date: Tue, 24 Feb 2026 08:43:15 GMT
- Title: MASPO: Unifying Gradient Utilization, Probability Mass, and Signal Reliability for Robust and Sample-Efficient LLM Reasoning
- Authors: Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Binbin Zheng, Chaowen Hu, Zekai Shao, Cong Qin, Lu Pan, Ke Zeng, Xunliang Cai,
- Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) algorithms rely on rigid, uniform, and symmetric trust region mechanisms.<n>We propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions.<n> MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence.
- Score: 16.012761588513026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing Reinforcement Learning with Verifiable Rewards (RLVR) algorithms, such as GRPO, rely on rigid, uniform, and symmetric trust region mechanisms that are fundamentally misaligned with the complex optimization dynamics of Large Language Models (LLMs). In this paper, we identify three critical challenges in these methods: (1) inefficient gradient utilization caused by the binary cutoff of hard clipping, (2) insensitive probability mass arising from uniform ratio constraints that ignore the token distribution, and (3) asymmetric signal reliability stemming from the disparate credit assignment ambiguity between positive and negative samples. To bridge these gaps, we propose Mass-Adaptive Soft Policy Optimization (MASPO), a unified framework designed to harmonize these three dimensions. MASPO integrates a differentiable soft Gaussian gating to maximize gradient utility, a mass-adaptive limiter to balance exploration across the probability spectrum, and an asymmetric risk controller to align update magnitudes with signal confidence. Extensive evaluations demonstrate that MASPO serves as a robust, all-in-one RLVR solution, significantly outperforming baselines. Our code is at: \href{https://github.com/VenomRose-Juri/MASPO-RL}{https://github.com/VenomRose-Juri/MASPO-RL}.
Related papers
- Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning [17.384089089363382]
We identify a root cause that existing methods overlook: the uniform penalization of errors.<n>Current approaches treat all incorrect rollouts within a group identically.<n>We propose the Asymmetric Confidence-aware Error Penalty (ACE)
arXiv Detail & Related papers (2026-02-24T22:46:43Z) - Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs [55.77845440440496]
Push-based decentralized communication enables optimization over communication networks, where information exchange may be asymmetric.<n>We develop a unified uniform-stability framework for the Gradient Push (SGP) algorithm.<n>A key technical ingredient is an imbalance-aware generalization bound through two quantities.
arXiv Detail & Related papers (2026-02-24T05:32:03Z) - Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z) - FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight [21.731032636844237]
This paper proposes a neuro-symbolic framework that employs a bidirectional Formal-of-Thought architecture.<n>We validate across three benchmarks spanning behavioral safety, multi-domain constraint adherence, and agentic upward deception detection.
arXiv Detail & Related papers (2026-02-11T18:48:11Z) - Equivariant Evidential Deep Learning for Interatomic Potentials [55.6997213490859]
Uncertainty quantification is critical for assessing the reliability of machine learning interatomic potentials in molecular dynamics simulations.<n>Existing UQ approaches for MLIPs are often limited by high computational cost or suboptimal performance.<n>We propose textitEquivariant Evidential Deep Learning for Interatomic Potentials ($texte2$IP), a backbone-agnostic framework that models atomic forces and their uncertainty jointly.
arXiv Detail & Related papers (2026-02-11T02:00:25Z) - Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals [18.612081365101464]
We develop a framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains.<n>Across simulations, our one-step estimator substantially improves ranking accuracy with gains increasing as model output noise grows.<n>Experiments on GPQA Diamond, AIME 2025 and GSM8K further demonstrate more precise performance estimation and more reliable model rankings.
arXiv Detail & Related papers (2026-02-03T03:40:01Z) - Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z) - Generation Order and Parallel Decoding in Masked Diffusion Models: An Information-Theoretic Perspective [16.942478643768144]
Masked Diffusion Models (MDMs) significantly accelerate inference by trading off sequential determinism.<n>We provide a unified information-theoretic framework to decouple and analyze two fundamental sources of failure: order sensitivity and parallelization bias.
arXiv Detail & Related papers (2026-01-30T20:15:18Z) - Majorization-Minimization Networks for Inverse Problems: An Application to EEG Imaging [4.063392865490957]
Inverse problems are often ill-posed and require optimization schemes with strong stability and convergence guarantees.<n>We propose a learned Majorization-Minimization (MM) framework for inverse problems within a bilevel optimization setting.<n>We learn a structured curvature majorant that governs each MM step while preserving classical MM descent guarantees.
arXiv Detail & Related papers (2026-01-23T10:33:45Z) - Feature-Space Adversarial Robustness Certification for Multimodal Large Language Models [59.6491828112519]
Multimodal large language models (MLLMs) exhibit strong capabilities across diverse applications.<n> MLLMs are vulnerable to adversarial perturbations that distort their feature representations and induce erroneous predictions.<n>We propose Feature-space Smoothing (FS), a general framework that provides certified robustness guarantees at the feature representation level of MLLMs.
arXiv Detail & Related papers (2026-01-22T18:52:21Z) - RMBRec: Robust Multi-Behavior Recommendation towards Target Behaviors [26.88506691092044]
We propose Robust Multi-Behavior Recommendation towards Target Behaviors (RMBRec)<n>RMBRec is a robust multi-behavior recommendation framework grounded in an information-theoretic robustness principle.<n>We show that RMBRec outperforms state-of-the-art methods in accuracy and maintains remarkable stability under various noise perturbations.
arXiv Detail & Related papers (2026-01-13T16:34:17Z) - Reinforcement Learning Using known Invariances [54.91261509214309]
This paper develops a theoretical framework for incorporating known group symmetries into kernel-based reinforcement learning.<n>We show that symmetry-aware RL achieves significantly better performance than their standard kernel counterparts.
arXiv Detail & Related papers (2025-11-05T13:56:14Z) - CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks [96.64597365827046]
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks.<n>We introduce a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity.<n>We show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks.
arXiv Detail & Related papers (2025-11-01T04:37:01Z) - Robust Iterative Learning Hidden Quantum Markov Models [0.7493761475572844]
Hidden Quantum Markov Models (HQMMs) extend classical Hidden Markov Models to the quantum domain.<n>Existing HQMM learning algorithms are sensitive to data corruption and lack mechanisms to ensure robustness under adversarial perturbations.<n>We introduce the Adversarially Corrupted HQMM, which formalizes robustness analysis by allowing a controlled fraction of observation sequences to be adversarially corrupted.
arXiv Detail & Related papers (2025-10-27T11:48:44Z) - MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics [72.00014675808228]
Instability in Large Language Models evaluation process obscures true learning dynamics.<n>We introduce textbfMaP, a framework that integrates underlineMerging underlineand the underlinePass@k metric.<n>Experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent rankings.
arXiv Detail & Related papers (2025-10-10T11:40:27Z) - Stochastic Approximation Methods for Distortion Risk Measure Optimization [2.97238992700289]
This paper proposes descent algorithms for DRM optimization based on two dual representations.<n>The DM-form employs a three-timescale algorithm to track quantiles, compute their gradients, and update decision variables.<n>The QF-form provides a simpler two-timescale approach that avoids the need for complex quantile gradient estimation.
arXiv Detail & Related papers (2025-10-06T07:59:09Z) - MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources [113.33902847941941]
Variance-Aware Sampling (VAS) is a data selection strategy guided by Variance Promotion Score (VPS)<n>We release large-scale, carefully curated resources containing 1.6M long CoT cold-start data and 15k RL QA pairs.<n> Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS.
arXiv Detail & Related papers (2025-09-25T14:58:29Z) - Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts [80.32933059529135]
Test-Time Adaptation (TTA) methods have emerged to adapt to target distributions during inference.<n>We propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD.<n>In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues.
arXiv Detail & Related papers (2025-08-28T07:09:21Z) - Robust Quantum Control: Analysis & Synthesis via Averaging [0.2320417845168326]
An approach is presented for robustness analysis and quantum (unitary) control synthesis based on the classic method of averaging.
The result is a multicriterion optimization competing the nominal (uncertainty-free) fidelity with a well known robustness measure: the size of an interaction (error) Hamiltonian.
arXiv Detail & Related papers (2022-08-30T12:09:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.