Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF
- URL: http://arxiv.org/abs/2601.12415v2
- Date: Wed, 21 Jan 2026 14:54:54 GMT
- Title: Orthogonalized Policy Optimization:Decoupling Sampling Geometry from Optimization Geometry in RLHF
- Authors: Wang Zixian,
- Abstract summary: Large language model alignment objectives are often presented as a collection of distinct algorithms, such as PPO, DPO, IPO, and their variants.<n>In this work, we argue that this diversity obscures a simpler underlying structure.<n>We show that this entanglement is not merely a modeling convenience but a source of systematic instability.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language model alignment objectives are often presented as a collection of distinct algorithms, such as PPO, DPO, IPO, and their variants, each motivated by different derivations. In this work, we argue that this diversity obscures a simpler underlying structure. At a fundamental level, alignment objectives involve two independent design choices: (i) how training signals are sampled and weighted, and (ii) how deviations from a reference policy are geometrically penalized. Existing methods typically entangle these choices through a single divergence, most commonly the Kullback-Leibler divergence. We show that this entanglement is not merely a modeling convenience but a source of systematic instability. When the same divergence simultaneously determines sample weighting and optimization curvature, adjusting one aspect, such as exploration strength, inevitably alters the other, such as gradient geometry. This coupling is particularly problematic in preference-based reinforcement learning, where advantage signals are unbounded and high-confidence regimes are common. We propose a simple but structural remedy by formulating alignment as an orthogonal mirror descent problem, in which sampling geometry enters only as a linear driving force, while optimization geometry is determined independently by a mirror map. This perspective leads to a new alignment objective called Orthogonalized Policy Optimization (OPO), obtained by choosing a Euclidean mirror map in likelihood ratio space. The resulting objective admits a closed-form solution, linear and non-saturating gradient dynamics, and a well-conditioned trust region, while remaining fully compatible with standard large language model training pipelines.
Related papers
- Why Deep Jacobian Spectra Separate: Depth-Induced Scaling and Singular-Vector Alignment [10.515277266852838]
We show that depth-induced exponential scaling of ordered singular values and strong spectral separation can be used to study deep Jacobians.<n>We further show that sufficiently strong separation forces singular-vector alignment in matrix products, yielding an approximately shared singular basis for intermediate Jacobians.<n> Experiments in fixed-gates settings validate the predicted scaling, alignment, and resulting dynamics.
arXiv Detail & Related papers (2026-02-12T20:27:59Z) - ARGUS: Adaptive Rotation-Invariant Geometric Unsupervised System [0.0]
This paper introduces Argus, a framework that reconceptualizes drift detection as tracking local statistics over a fixed spatial partition of the data manifold.<n> Voronoi tessellations over canonical orthonormal frames yield drift metrics that are invariant to transformations.<n>A graph-theoretic characterization of drift propagation is developed that distinguishes coherent distributional shifts from isolated perturbations.
arXiv Detail & Related papers (2026-01-03T22:39:20Z) - Parallel Diffusion Solver via Residual Dirichlet Policy Optimization [88.7827307535107]
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature.<n>Existing solver-based acceleration methods often face significant image quality degradation under a low-dimensional budget.<n>We propose the Ensemble Parallel Direction solver (dubbed as EPD-EPr), a novel ODE solver that mitigates these errors by incorporating multiple gradient parallel evaluations in each step.
arXiv Detail & Related papers (2025-12-28T05:48:55Z) - Learning Geometry: A Framework for Building Adaptive Manifold Models through Metric Optimization [8.201374511929538]
This paper proposes a novel paradigm for machine learning that moves beyond traditional parameter optimization.<n>We optimize the metric tensor field on a manifold with a predefined topology, thereby dynamically shaping the geometric structure of the model space.<n>This work lays a solid foundation for constructing fully dynamic "meta-learners" capable of autonomously evolving their geometry and topology.
arXiv Detail & Related papers (2025-10-30T01:53:32Z) - Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods [50.070182958880146]
We propose a unified framework generalizing descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms.<n>Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix- parameterized setting.<n>We introduce two new methods, $ttMuAdam$ and $texttMuAdam-SANIA$, which combine the spectral geometry of Muon with Adam-style preconditioning.
arXiv Detail & Related papers (2025-10-12T19:39:41Z) - Non-Euclidean Broximal Point Method: A Blueprint for Geometry-Aware Optimization [55.002497070656624]
Broximal Point Method (BPM) offers an idealized optimization framework based on iteratively minimizing the objective function over norm balls centered at the current iterate.<n>It enjoys striking global convergence guarantees, converging linearly and in a finite number of steps for proper, closed and convex functions.<n>In this note, we ask whether the convergence theory of BPM can be extended to this more general, non-Euclidean setting.
arXiv Detail & Related papers (2025-10-01T12:32:52Z) - Neural Optimal Transport Meets Multivariate Conformal Prediction [58.43397908730771]
We propose a framework for conditional vectorile regression (CVQR)<n>CVQR combines neural optimal transport with quantized optimization, and apply it to predictions.
arXiv Detail & Related papers (2025-09-29T19:50:19Z) - Enforcing Latent Euclidean Geometry in Single-Cell VAEs for Manifold Interpolation [79.27003481818413]
We introduce FlatVI, a training framework that regularises the latent manifold of discrete-likelihood variational autoencoders towards Euclidean geometry.<n>By encouraging straight lines in the latent space to approximate geodesics on the decoded single-cell manifold, FlatVI enhances compatibility with downstream approaches.
arXiv Detail & Related papers (2025-07-15T23:08:14Z) - Generalized Linear Mode Connectivity for Transformers [87.32299363530996]
A striking phenomenon is linear mode connectivity (LMC), where independently trained models can be connected by low- or zero-loss paths.<n>Prior work has predominantly focused on neuron re-ordering through permutations, but such approaches are limited in scope.<n>We introduce a unified framework that captures four symmetry classes: permutations, semi-permutations, transformations, and general invertible maps.<n>This generalization enables, for the first time, the discovery of low- and zero-barrier linear paths between independently trained Vision Transformers and GPT-2 models.
arXiv Detail & Related papers (2025-06-28T01:46:36Z) - Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness [51.302674884611335]
This work introduces a hybrid non-Euclidean optimization method which generalizes norm clipping by combining steepest descent and conditional gradient approaches.<n>We discuss how to instantiate the algorithms for deep learning and demonstrate their properties on image classification and language modeling.
arXiv Detail & Related papers (2025-06-02T17:34:29Z) - Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [53.03951222945921]
We analyze smoothed (perturbed) policies, adding controlled random perturbations to the direction used by the linear oracle.<n>Our main contribution is a generalization bound that decomposes the excess risk into perturbation bias, statistical estimation error, and optimization error.<n>We illustrate the scope of the results on applications such as vehicle scheduling, highlighting how smoothing enables both tractable training and controlled generalization.
arXiv Detail & Related papers (2024-07-24T12:00:30Z) - Differentially Private Optimization with Sparse Gradients [60.853074897282625]
We study differentially private (DP) optimization problems under sparsity of individual gradients.
Building on this, we obtain pure- and approximate-DP algorithms with almost optimal rates for convex optimization with sparse gradients.
arXiv Detail & Related papers (2024-04-16T20:01:10Z) - Geometry-Aware Normalizing Wasserstein Flows for Optimal Causal
Inference [0.0]
This paper presents a groundbreaking approach to causal inference by integrating continuous normalizing flows with parametric submodels.
We leverage optimal transport and Wasserstein gradient flows to develop causal inference methodologies with minimal variance in finite-sample settings.
Preliminary experiments showcase our method's superiority, yielding lower mean-squared errors compared to standard flows.
arXiv Detail & Related papers (2023-11-30T18:59:05Z) - Adaptive Zeroth-Order Optimisation of Nonconvex Composite Objectives [1.7640556247739623]
We analyze algorithms for zeroth-order entropy composite objectives, focusing on dependence on dimensionality.
This is achieved by exploiting low dimensional structure of the decision set using the mirror descent method with an estimation alike function.
To improve the gradient, we replace the classic sampling method based on Rademacher and show that the mini-batch method copes with non-Eucli geometry.
arXiv Detail & Related papers (2022-08-09T07:36:25Z) - Parametric Generative Schemes with Geometric Constraints for Encoding
and Synthesizing Airfoils [25.546237636065182]
Two deep learning-based generative schemes are proposed to capture the complexity of the design space while satisfying specific constraints.
The soft-constrained scheme generates airfoils with slight deviations from the expected geometric constraints, yet still converge to the reference airfoil.
The hard-constrained scheme produces airfoils with a wider range of geometric diversity while strictly adhering to the geometric constraints.
arXiv Detail & Related papers (2022-05-05T05:58:08Z) - GELATO: Geometrically Enriched Latent Model for Offline Reinforcement
Learning [54.291331971813364]
offline reinforcement learning approaches can be divided into proximal and uncertainty-aware methods.
In this work, we demonstrate the benefit of combining the two in a latent variational model.
Our proposed metrics measure both the quality of out of distribution samples as well as the discrepancy of examples in the data.
arXiv Detail & Related papers (2021-02-22T19:42:40Z) - On the Convergence Rate of Projected Gradient Descent for a
Back-Projection based Objective [58.33065918353532]
We consider a back-projection based fidelity term as an alternative to the common least squares (LS)
We show that using the BP term, rather than the LS term, requires fewer iterations of optimization algorithms.
arXiv Detail & Related papers (2020-05-03T00:58:23Z) - Geometry, Computation, and Optimality in Stochastic Optimization [24.154336772159745]
We study computational and statistical consequences of problem geometry in and online optimization.<n>By focusing on constraint set and gradient geometry, we characterize the problem families for which- and adaptive-gradient methods are (minimax) optimal.
arXiv Detail & Related papers (2019-09-23T16:14:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.