Related papers: Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?

URL: http://arxiv.org/abs/2505.23749v1
Date: Thu, 29 May 2025 17:59:20 GMT
Title: Distortion of AI Alignment: Does Preference Optimization Optimize for Preferences?
Authors: Paul Gölz, Nika Haghtalab, Kunhe Yang,
Abstract summary: After pre-training, large language models are aligned with human preferences based on pairwise comparisons.<n>We introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy.
Score: 20.004349891563706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average -- a minimal requirement for pluralistic alignment. Drawing on social choice theory and modeling users' comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method's distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of $(\frac{1}{2} + o(1)) \cdot \beta$ (for the BT temperature $\beta$), robustly across utility distributions, distributions of comparison pairs, and permissible KL divergences from the reference policy. RLHF and DPO, by contrast, suffer $\geq (1 - o(1)) \cdot \beta$ distortion already without a KL constraint, and $e^{\Omega(\beta)}$ or even unbounded distortion in the full setting, depending on how comparison pairs are sampled.

Related papers

On Monotonicity in AI Alignment [10.244128221542228]
This paper investigates the root causes of (non) monotonicity, for a general comparison-based preference learning framework.<n>Under mild assumptions, we prove that such methods still satisfy what we call local pairwise monotonicity.<n>We also provide a bouquet of formalizations of monotonicity, and identify sufficient conditions for their guarantee, thereby providing a toolbox to evaluate how prone learning models are to monotonicity violations.
arXiv Detail & Related papers (2025-06-10T17:17:48Z)
Reverse Preference Optimization for Complex Instruction Following [61.39734201711077]
We propose a simple yet effective method called Reverse Preference Optimization (RPO)<n>It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect.<n>RPO scales effectively across model sizes, with the 70B RPO model surpassing GPT-4o.
arXiv Detail & Related papers (2025-05-28T09:44:27Z)
KL Penalty Control via Perturbation for Direct Preference Optimization [53.67494512877768]
We propose $varepsilon$-Direct Preference Optimization ($varepsilon$-DPO), which allows adaptive control of the KL penalty strength $beta$ for each preference pair.<n> Experimental results show that the simple criterion of $varepsilon$-DPO for KL penalty relaxation significantly improves DPO compared to most existing direct alignment algorithms.
arXiv Detail & Related papers (2025-02-18T06:44:10Z)
Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.<n>CaPO incorporates the general preference from multiple reward models without human annotated data.<n> Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z)
SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment [16.230186347702737]
We propose Simultaneous Weighted Preference Optimization (SWEPO)<n>SWEPO incorporates multiple responses per query and prioritizes those that deviate most from the average reward.<n>We prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of $mathcalO(tfrac1sqrtk)$.
arXiv Detail & Related papers (2024-12-05T21:50:22Z)
SePPO: Semi-Policy Preference Optimization for Diffusion Alignment [67.8738082040299]
We propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. We validate SePPO across both text-to-image and text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
Geometric-Averaged Preference Optimization for Soft Preference Labels [78.2746007085333]
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic.<n>In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function.
arXiv Detail & Related papers (2024-09-10T17:54:28Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences. We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game. Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs.<n>Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
Provably Robust DPO: Aligning Language Models with Noisy Feedback [10.523790076060171]
We introduce a general framework for policy optimization in the presence of random preference flips. We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise. Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO.
arXiv Detail & Related papers (2024-03-01T09:55:18Z)
Active Preference Optimization for Sample Efficient RLHF [27.772423917657626]
Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models with human preferences. Current methods rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations. We develop an active-learning algorithm, $textttAPO$, which enhances model alignment by querying preference data.
arXiv Detail & Related papers (2024-02-16T08:19:34Z)
Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the reference policy is an upper bound on the actual KL divergence.<n>We also propose a new estimator for the KL divergence and empirically show that it provides a tight approximation.<n>We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy.
arXiv Detail & Related papers (2024-01-03T18:39:13Z)
Lossy Compression with Distortion Constrained Optimization [14.45964083146559]
We show that the constrained optimization method of Rezende and Viola, 2018 is more appropriate for training lossy compression models than a $beta$-VAE. We show that the method does manage to satisfy the constraint on a realistic image compression task, outperforms a constrained optimization method based on a hinge-loss, and is more practical to use for model selection than a $beta$-VAE.
arXiv Detail & Related papers (2020-05-08T14:27:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.