Related papers: Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

URL: http://arxiv.org/abs/2410.08847v2
Date: Mon, 14 Oct 2024 02:22:24 GMT
Title: Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Authors: Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin,
Abstract summary: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Prior work has observed that the likelihood of preferred responses often decreases during training. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning.
Score: 60.176008034221404
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

Related papers

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO [19.5712961932773]
We revisit direct preference optimization (DPO) and demonstrate that its loss theoretically admits a decomposed reformulation.<n>We introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types.
arXiv Detail & Related papers (2025-05-29T10:23:22Z)
Unbiased Collaborative Filtering with Fair Sampling [31.8123420283795]
We show that popularity bias arises from the influence of propensity factors during training. We propose a fair sampling (FS) method that ensures each user and each item has an equal likelihood of being selected as both positive and negative instances.
arXiv Detail & Related papers (2025-02-19T15:59:49Z)
DPO-Shift: Shifting the Distribution of Direct Preference Optimization [13.261638548369188]
We introduce method to controllably shift the distribution of the chosen probability. We show that method exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin.
arXiv Detail & Related papers (2025-02-11T14:49:44Z)
Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z)
A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement [47.95776810771774]
Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model alignment. In this paper, we identify a common pitfall of margin-based methods. We demystify the reasons behind these problematic behaviors.
arXiv Detail & Related papers (2024-10-17T17:52:01Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We show that our approach consistently boosts DPO by a considerable margin. Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations. DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z)
Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z)
Contextual Linear Optimization with Bandit Feedback [35.692428244561626]
Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients. We study a class of offline learning algorithms for CLO with bandit feedback. We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate.
arXiv Detail & Related papers (2024-05-26T13:27:27Z)
Regression with Cost-based Rejection [30.43900105405108]
We investigate a novel regression problem where the model can reject to make predictions on some examples given certain rejection costs. We derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost.
arXiv Detail & Related papers (2023-11-08T09:33:21Z)
Calibrated Selective Classification [34.08454890436067]
We develop a new approach to selective classification in which we propose a method for rejecting examples with "uncertain" uncertainties. We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model. We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.
arXiv Detail & Related papers (2022-08-25T13:31:09Z)
Selective Regression Under Fairness Criteria [30.672082160544996]
In some cases, the performance of minority group can decrease while we reduce the coverage. We show that such an unwanted behavior can be avoided if we can construct features satisfying the sufficiency criterion.
arXiv Detail & Related papers (2021-10-28T19:05:12Z)
Probabilistic and Variational Recommendation Denoising [56.879165033014026]
Learning from implicit feedback is one of the most common cases in the application of recommender systems. We propose probabilistic and variational recommendation denoising for implicit feedback. We employ the proposed DPI and DVAE on four state-of-the-art recommendation models and conduct experiments on three datasets.
arXiv Detail & Related papers (2021-05-20T08:59:44Z)
Deconfounding Scores: Feature Representations for Causal Effect Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation. We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data. In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z)
Sampler Design for Implicit Feedback Data by Noisy-label Robust Learning [32.76804332450971]
We design an adaptive sampler based on noisy-label robust learning for implicit feedback data. We predict users' preferences with the model and learn it by maximizing likelihood of observed data labels. We then consider the risk of these noisy labels, and propose a Noisy-label Robust BPO.
arXiv Detail & Related papers (2020-06-28T05:31:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.