One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry
- URL: http://arxiv.org/abs/2601.22521v1
- Date: Fri, 30 Jan 2026 03:58:54 GMT
- Title: One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry
- Authors: Weisong Zhao, Tong Wang, Zichang Tan, Te Yang, Siran Peng, Haoyuan Zhang, Tianshuo Zhang, Haichao Shi, Meng Meng, Yang Yang, Xiangyu Zhu, Zhen Lei, Xiao-Yu Zhang, Xu Zhou,
- Abstract summary: Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO.<n>We unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry.<n>We show that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution.
- Score: 40.539393367855176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO. While GMPO improves stability by constraining a conservative objective, it shares a fundamental limitation with GRPO: reliance on a fixed aggregation geometry that ignores the evolving and heterogeneous nature of each trajectory. In this work, we unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry via the power-mean geometry exponent p. Within this framework, GRPO and GMPO are recovered as special cases. Theoretically, we demonstrate that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution. To determine p adaptively, we introduce a Clip-aware Effective Sample Size (ESS) mechanism. Specifically, we propose a deterministic rule that maps a trajectory clipping fraction to a target ESS. Then, we solve for the specific p to align the trajectory induced ESS with this target one. This allows PMPO to dynamically transition between the aggressive arithmetic mean for reliable trajectories and the conservative geometric mean for unstable ones. Experiments on multiple mathematical reasoning benchmarks demonstrate that PMPO outperforms strong baselines.
Related papers
- Stability and Generalization of Push-Sum Based Decentralized Optimization over Directed Graphs [55.77845440440496]
Push-based decentralized communication enables optimization over communication networks, where information exchange may be asymmetric.<n>We develop a unified uniform-stability framework for the Gradient Push (SGP) algorithm.<n>A key technical ingredient is an imbalance-aware generalization bound through two quantities.
arXiv Detail & Related papers (2026-02-24T05:32:03Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - Spherical Steering: Geometry-Aware Activation Rotation for Language Models [15.078810641141295]
Inference-time steering has emerged as a promising paradigm for controlling language models (LMs) without the cost of retraining.<n>In this work, we explore Spherical Steering, a training-free primitive that resolves this trade-off through activation rotation.<n>Our method rotates activations along a geodesic toward a target direction, guiding the activation toward the target concept while preserving the integrity of the signal.
arXiv Detail & Related papers (2026-02-09T00:15:47Z) - Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF [0.0]
Large language model alignment objectives are often presented as a collection of distinct algorithms, such as PPO, DPO, IPO, and their variants.<n>In this work, we argue that this diversity obscures a simpler underlying structure.<n>We show that this entanglement is not merely a modeling convenience but a source of systematic instability.
arXiv Detail & Related papers (2026-01-18T13:57:44Z) - Geometric-Mean Policy Optimization [117.05113769757172]
Group Relative Policy Optimization ( GRPO) has significantly enhanced the reasoning capability of large language models.<n> GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards.<n>We propose Geometric-Mean Policy Optimization (GMPO) to improve the stability of GRPO through suppressing token reward outliers.
arXiv Detail & Related papers (2025-07-28T09:54:05Z) - CP$^2$: Leveraging Geometry for Conformal Prediction via Canonicalization [51.716834831684004]
We study the problem of conformal prediction (CP) under geometric data shifts.<n>We propose integrating geometric information--such as geometric pose--into the conformal procedure to reinstate its guarantees.
arXiv Detail & Related papers (2025-06-19T10:12:02Z) - GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization [63.107398132743825]
Group Contrastive Policy Optimization (GCPO) is a novel reinforcement learning framework featuring two key innovations.<n>We develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction.
arXiv Detail & Related papers (2025-06-08T14:18:15Z) - Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning [12.987019067098412]
We adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in Reinforcement Learning (RL)
We prove that APG converges to an optimal policy at rates: (i) $tildeO (1/t2)$ with constant step sizes; (ii) $O(e-ct)$ with exponentially-growing step sizes.
arXiv Detail & Related papers (2023-10-18T11:33:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.