Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
- URL: http://arxiv.org/abs/2410.08847v2
- Date: Mon, 14 Oct 2024 02:22:24 GMT
- Title: Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
- Authors: Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin,
- Abstract summary: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences.
Prior work has observed that the likelihood of preferred responses often decreases during training.
We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning.
- Score: 60.176008034221404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences. Although these methods are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we term likelihood displacement. We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning. As a simple example, training a model to prefer $\texttt{No}$ over $\texttt{Never}$ can sharply increase the probability of $\texttt{Yes}$. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement can unintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by a centered hidden embedding similarity (CHES) score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.
Related papers
- Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement [47.95776810771774]
Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model alignment.
In this paper, we identify a common pitfall of margin-based methods.
We demystify the reasons behind these problematic behaviors.
arXiv Detail & Related papers (2024-10-17T17:52:01Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations.
DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning.
We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - Contextual Linear Optimization with Bandit Feedback [35.692428244561626]
Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients.
We study a class of offline learning algorithms for CLO with bandit feedback.
We show a fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of the optimization estimate.
arXiv Detail & Related papers (2024-05-26T13:27:27Z) - Calibrated Selective Classification [34.08454890436067]
We develop a new approach to selective classification in which we propose a method for rejecting examples with "uncertain" uncertainties.
We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model.
We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.
arXiv Detail & Related papers (2022-08-25T13:31:09Z) - Selective Regression Under Fairness Criteria [30.672082160544996]
In some cases, the performance of minority group can decrease while we reduce the coverage.
We show that such an unwanted behavior can be avoided if we can construct features satisfying the sufficiency criterion.
arXiv Detail & Related papers (2021-10-28T19:05:12Z) - Probabilistic and Variational Recommendation Denoising [56.879165033014026]
Learning from implicit feedback is one of the most common cases in the application of recommender systems.
We propose probabilistic and variational recommendation denoising for implicit feedback.
We employ the proposed DPI and DVAE on four state-of-the-art recommendation models and conduct experiments on three datasets.
arXiv Detail & Related papers (2021-05-20T08:59:44Z) - Deconfounding Scores: Feature Representations for Causal Effect
Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation.
We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data.
In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z) - Sampler Design for Implicit Feedback Data by Noisy-label Robust Learning [32.76804332450971]
We design an adaptive sampler based on noisy-label robust learning for implicit feedback data.
We predict users' preferences with the model and learn it by maximizing likelihood of observed data labels.
We then consider the risk of these noisy labels, and propose a Noisy-label Robust BPO.
arXiv Detail & Related papers (2020-06-28T05:31:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.