Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning
- URL: http://arxiv.org/abs/2601.03320v1
- Date: Tue, 06 Jan 2026 14:01:42 GMT
- Title: Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning
- Authors: Yu Luo, Shuo Han, Yihan Hu, Dong Li, Jianye Hao,
- Abstract summary: We propose a primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse.<n>$R2VPO$ achieves superior performance with average relative gains of up to 17% over strong clipping-based baselines.
- Score: 48.34492357368989
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: On-policy reinforcement learning (RL), particularly Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), has become the dominant paradigm for fine-tuning large language models (LLMs). While policy ratio clipping stabilizes training, this heuristic hard constraint incurs a fundamental cost: it indiscriminately truncates gradients from high-return yet high-divergence actions, suppressing rare but highly informative "eureka moments" in complex reasoning. Moreover, once data becomes slightly stale, hard clipping renders it unusable, leading to severe sample inefficiency. In this work, we revisit the trust-region objective in policy optimization and show that explicitly constraining the \emph{variance (second central moment) of the policy ratio} provides a principled and smooth relaxation of hard clipping. This distributional constraint stabilizes policy updates while preserving gradient signals from valuable trajectories. Building on this insight, we propose $R^2VPO$ (Ratio-Variance Regularized Policy Optimization), a novel primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse by dynamically reweighting stale samples rather than discarding them. We extensively evaluate $R^2VPO$ on fine-tuning state-of-the-art LLMs, including DeepSeek-Distill-Qwen-1.5B and the openPangu-Embedded series (1B and 7B), across challenging mathematical reasoning benchmarks. Experimental results show that $R^2VPO$ consistently achieves superior asymptotic performance, with average relative gains of up to 17% over strong clipping-based baselines, while requiring approximately 50% fewer rollouts to reach convergence. These findings establish ratio-variance control as a promising direction for improving both stability and data efficiency in RL-based LLM alignment.
Related papers
- GIPO: Gaussian Importance Sampling Policy Optimization [12.306486689840774]
GIPO is proposed as a policy optimization objective based on truncated importance sampling.<n>It replaces hard clipping with a log-ratio-based Gaussian trust weight to damp extreme importance ratios.<n>GIPO achieves state-of-the-art performance among clipping-based baselines across a wide range of replay buffer sizes.
arXiv Detail & Related papers (2026-03-04T11:34:59Z) - Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs [19.079556051442168]
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks.<n>But for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly noisy.<n>We propose a stabilization method for REINFORCE/ GRPO-style algorithms that scales learning rate based on effective sample size to dampen unreliable updates.
arXiv Detail & Related papers (2026-02-19T18:40:51Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning [49.57517969069136]
We introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings.<n>AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards.<n>It consistently improves learning stability and performance across multiple benchmarks over strong baselines.
arXiv Detail & Related papers (2025-10-02T04:24:27Z) - Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning [77.92320830700797]
Reinforcement Learning has played a central role in enabling reasoning capabilities of Large Language Models.<n>We propose a tractable computational framework that tracks and leverages curvature information during policy updates.<n>The algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out.
arXiv Detail & Related papers (2025-10-01T12:29:32Z) - Robust Offline Reinforcement Learning with Linearly Structured f-Divergence Regularization [11.739526562075339]
The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function.<n>We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels.
arXiv Detail & Related papers (2024-11-27T18:57:03Z) - Bi-Level Offline Policy Optimization with Limited Exploration [1.8130068086063336]
We study offline reinforcement learning (RL) which seeks to learn a good policy based on a fixed, pre-collected dataset.
We propose a bi-level structured policy optimization algorithm that models a hierarchical interaction between the policy (upper-level) and the value function (lower-level)
We evaluate our model using a blend of synthetic, benchmark, and real-world datasets for offline RL, showing that it performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2023-10-10T02:45:50Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.