Related papers: Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment

Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment

URL: http://arxiv.org/abs/2602.01685v1
Date: Mon, 02 Feb 2026 05:56:16 GMT
Title: Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment
Authors: Byeonghu Na, Hyungho Na, Yeongmin Kim, Suhyeon Jo, HeeSun Bae, Mina Kang, Il-Chul Moon,
Abstract summary: We propose a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance.<n>Our method outperforms KL- and $f$divergence-based baselines.
Score: 30.266966684932186
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are commonly aligned with human preferences using reinforcement learning from human feedback (RLHF). In this method, LLM policies are generally optimized through reward maximization with Kullback-Leibler (KL) divergence regularization of the reference policy. However, KL and its $f$-divergence variants only compare token probabilities at identical indices, failing to capture semantic similarity. We propose Wasserstein Policy Regularization (WPR), a semantic-aware regularization for the RLHF framework based on the entropy-regularized Wasserstein distance, which incorporates the geometry of the token space. The dual formulation of the distance expresses the regularization as penalty terms applied to the reward via optimal dual variables, which yield a tractable objective compatible with standard RL algorithms. Empirically, our method outperforms KL- and $f$-divergence-based baselines, demonstrating the benefits of semantic-aware policy distances for alignment. Our code is available at https://github.com/aailab-kaist/WPR.

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model [43.74350307533018]
We study policy alignment to preferences under an unknown and unrestricted complexity.<n>We use first-order optimization suited to neural networks and batched data.
arXiv Detail & Related papers (2025-12-26T08:22:41Z)
KL-Regularized Reinforcement Learning is Designed to Mode Collapse [29.23421728376746]
We show that the choice of reverse/forward KL determines the family of optimal target distributions.<n>We leverage these insights to construct a simple, scalable, and theoretically justified algorithm.
arXiv Detail & Related papers (2025-10-23T17:59:40Z)
Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games [53.447182734351]
We develop and analyze algorithms that provably achieve improved sample efficiency under Reverse Kullback-Leibler (KL) regularization.<n>We study both two-player zero-sum Matrix games and Markov games: for Matrix games, we propose OMG, an algorithm based on best response sampling with optimistic bonuses, and extend this idea to Markov games through the algorithm SOMG.<n>Both algorithms achieve a logarithmic regret in $T$ that scales inversely with the KL regularization strength $beta$ in addition to the standard $widetildemathcalO(sqrtT)
arXiv Detail & Related papers (2025-10-15T01:00:54Z)
On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning [59.11784194183928]
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs)<n>Regularized Policy Gradient (RPG) view shows that the widely-used $k_3$ penalty is exactly the unnormalized KL.<n>RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO.
arXiv Detail & Related papers (2025-05-23T06:01:21Z)
What is the Alignment Objective of GRPO? [30.36318490634376]
We present a framework that enables us to characterise the stationary policies of the GRPO algorithm.<n>The precise form of preference aggregation arises from the way the reward preference model is defined and from the penalty function.<n>We provide explicit characterisations of the aggregate preference for binary questions, for groups of size two, and in the limit of large group size.
arXiv Detail & Related papers (2025-02-25T15:56:56Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [15.038210624870656]
Reward inference is a critical intermediate step in the Reinforcement Learning from Human Feedback pipeline.<n>This paper develops two RLHF algorithms without reward inference for general RL problems beyond bandits and deterministic MDP bandit, and general preference models beyond the Bradley-Terry model.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
WARP: On the Benefits of Weight Averaged Rewarded Policies [66.95013068137115]
We introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP) WARP merges policies in the weight space at three distinct stages. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
arXiv Detail & Related papers (2024-06-24T16:24:34Z)
Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the reference policy is an upper bound on the actual KL divergence.<n>We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation.<n>We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy.
arXiv Detail & Related papers (2024-01-03T18:39:13Z)
Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections. We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer. The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.