Related papers: Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space

URL: http://arxiv.org/abs/2602.21269v1
Date: Tue, 24 Feb 2026 12:59:32 GMT
Title: Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space
Authors: Wang Zixian,
Abstract summary: We present a new alignment algorithm for large language models derived from the geometry of Hilbert function spaces.<n>GOPO lifts alignment into the Hilbert space L2(pi_k) of square-integrable functions.<n>Because group-normalized advantages sum to zero, the Lagrange multiplier enforcing probability conservation vanishes exactly.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Group Orthogonalized Policy Optimization (GOPO), a new alignment algorithm for large language models derived from the geometry of Hilbert function spaces. Instead of optimizing on the probability simplex and inheriting the exponential curvature of Kullback-Leibler divergence, GOPO lifts alignment into the Hilbert space L2(pi_k) of square-integrable functions with respect to the reference policy. Within this space, the simplex constraint reduces to a linear orthogonality condition <v, 1> = 0, defining a codimension-one subspace H0. Minimizing distance to an unconstrained target u_star yields the work-dissipation functional J(v) = <g, v> - (mu / 2) ||v||^2, whose maximizer follows directly from the Hilbert projection theorem. Enforcing the boundary v >= -1 produces a bounded Hilbert projection that induces exact sparsity, assigning zero probability to catastrophically poor actions through a closed-form threshold. To connect this functional theory with practice, GOPO projects from infinite-dimensional L2(pi_k) to a finite empirical subspace induced by group sampling. Because group-normalized advantages sum to zero, the Lagrange multiplier enforcing probability conservation vanishes exactly, reducing the constrained projection to an unconstrained empirical loss. The resulting objective has constant Hessian curvature mu I, non-saturating linear gradients, and an intrinsic dead-zone mechanism without heuristic clipping. Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback [50.89125374999765]
We provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($mathtOMWU$) in NLHF.<n>Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values.
arXiv Detail & Related papers (2025-12-31T12:08:29Z)
Random Gradient-Free Optimization in Infinite Dimensional Spaces [3.8031924942083517]
We propose a random gradient-free method for optimization in infinite dimensional Hilbert spaces.<n>Our framework requires only the computation of directional derivatives and a pre-basis for the Hilbert space domain.<n>We showcase the use of our method to solve partial differential equations la physics informed neural networks.
arXiv Detail & Related papers (2025-12-23T18:09:49Z)
Non-Euclidean Broximal Point Method: A Blueprint for Geometry-Aware Optimization [55.002497070656624]
Broximal Point Method (BPM) offers an idealized optimization framework based on iteratively minimizing the objective function over norm balls centered at the current iterate.<n>It enjoys striking global convergence guarantees, converging linearly and in a finite number of steps for proper, closed and convex functions.<n>In this note, we ask whether the convergence theory of BPM can be extended to this more general, non-Euclidean setting.
arXiv Detail & Related papers (2025-10-01T12:32:52Z)
Graph-based Clustering Revisited: A Relaxation of Kernel $k$-Means Perspective [73.18641268511318]
We propose a graph-based clustering algorithm that only relaxes the orthonormal constraint to derive clustering results.<n>To ensure a doubly constraint into a gradient, we transform the non-negative constraint into a class probability parameter.
arXiv Detail & Related papers (2025-09-23T09:14:39Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares Algorithm [30.08981916090924]
We present the first optimal rates for infinite-dimensional vector-valued ridge regression on a continuous scale of norms that interpolate between $L$ and the hypothesis space. We show that these rates are optimal in most cases and independent of the dimension of the output space.
arXiv Detail & Related papers (2023-12-12T11:48:56Z)
Revisiting Subgradient Method: Complexity and Convergence Beyond Lipschitz Continuity [24.45688490844496]
Subgradient method is one of the most fundamental algorithmic schemes for nonsmooth optimization. In this work, we first extend the typical iteration complexity results for the subgradient method to cover non-Lipschitz convex and weakly convex minimization.
arXiv Detail & Related papers (2023-05-23T15:26:36Z)
Hessian Based Smoothing Splines for Manifold Learning [0.228438857884398]
We propose a multidimensional smoothing spline algorithm in the context of manifold learning. We generalize the bending energy penalty of thin-plate splines to a quadratic form on the Sobolev space of a flat manifold. The existence and uniqueness of the solution is shown by applying the theory of reproducing Hilbert spaces.
arXiv Detail & Related papers (2023-02-10T02:49:05Z)
Kernel-based off-policy estimation without overlap: Instance optimality beyond semiparametric efficiency [53.90687548731265]
We study optimal procedures for estimating a linear functional based on observational data. For any convex and symmetric function class $mathcalF$, we derive a non-asymptotic local minimax bound on the mean-squared error.
arXiv Detail & Related papers (2023-01-16T02:57:37Z)
Convergence rate of the (1+1)-evolution strategy on locally strongly convex functions with lipschitz continuous gradient [10.31411804947731]
Evolution strategy (ES) is one of the promising classes of algorithms for black-box continuous optimization.<n>In this study, an upper bound and a lower bound of the rate of linear convergence of the (1+1)-ES on locally $L$-strongly convex functions with $U$-Lipschitz continuous gradient are derived.
arXiv Detail & Related papers (2022-09-26T07:16:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.