Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
- URL: http://arxiv.org/abs/2407.18676v1
- Date: Fri, 26 Jul 2024 11:38:18 GMT
- Title: Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
- Authors: Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic,
- Abstract summary: Current preference optimization algorithms do not account for temporal preference drift in their modeling.
We propose Non-Stationary Direct Preference optimisation (NS-DPO)
We show that NS-DPO fine-tuned LLMs remain robust under non-stationarity.
- Score: 21.16837827950466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning from human feedback (RLHF) aligns Large Language Models (LLMs) with human preferences. However, these preferences can often change over time due to external factors (e.g. environment change and societal influence). Consequently, what was wrong then might be right now. Current preference optimization algorithms do not account for temporal preference drift in their modeling, which can lead to severe misalignment. To address this limitation, we use a Dynamic Bradley-Terry model that models preferences via time-dependent reward functions, and propose Non-Stationary Direct Preference Optimisation (NS-DPO). By introducing a discount parameter in the loss function, NS-DPO applies exponential weighting, which proportionally focuses learning on more time-relevant datapoints. We theoretically analyse the convergence of NS-DPO in the offline setting, providing upper bounds on the estimation error caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO1 for fine-tuning LLMs in scenarios with drifting preferences. By simulating preference drift using renowned reward models and modifying popular LLM datasets accordingly, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
Related papers
- Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective [22.248134630764497]
We propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter.
Our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences.
arXiv Detail & Related papers (2025-02-20T07:53:11Z) - SPRec: Self-Play to Debias LLM-based Recommendation [23.875509546540904]
Large language models (LLMs) have attracted significant attention in recommendation systems.
We propose SPRec, a novel self-play framework designed to mitigate over-recommendation and improve fairness without requiring additional data or manual intervention.
arXiv Detail & Related papers (2024-12-12T12:53:30Z) - Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization [14.50339880957898]
We aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques.
For preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals.
For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced.
arXiv Detail & Related papers (2024-11-07T23:03:11Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs.
We find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
arXiv Detail & Related papers (2024-05-29T21:29:44Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations.
DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning.
We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Preference as Reward, Maximum Preference Optimization with Importance Sampling [3.7040071165219595]
We propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO)
MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm.
arXiv Detail & Related papers (2023-12-27T06:34:54Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.