As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss
- URL: http://arxiv.org/abs/2410.04834v2
- Date: Fri, 25 Oct 2024 07:41:45 GMT
- Title: As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss
- Authors: Xin Mao, Feng-Lin Li, Huimin Xu, Wei Zhang, Wang Chen, Anh Tuan Luu,
- Abstract summary: We propose a novel alignment loss that establishes a stable Bidirectional Negative Feedback (BNF) during optimization.
Our proposed BNF loss eliminates the need for pairwise contrastive losses.
We conduct extensive experiments across two challenging QA benchmarks and four reasoning benchmarks.
- Score: 26.860139372768092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct Preference Optimization (DPO) has emerged as a more computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO), eliminating the need for reward models and online sampling. Despite these benefits, DPO and its variants remain sensitive to hyper-parameters and prone to instability, particularly on mathematical datasets. We argue that these issues arise from the unidirectional likelihood-derivative negative feedback inherent in the log-likelihood loss function. To address this, we propose a novel LLM alignment loss that establishes a stable Bidirectional Negative Feedback (BNF) during optimization. Our proposed BNF loss eliminates the need for pairwise contrastive losses and does not require any extra tunable hyper-parameters or pairwise preference data, streamlining the alignment pipeline to be as simple as supervised fine-tuning. We conduct extensive experiments across two challenging QA benchmarks and four reasoning benchmarks. The experimental results show that BNF achieves comparable performance to the best methods on QA benchmarks, while its performance decrease on the four reasoning benchmarks is significantly lower compared to the best methods, thus striking a better balance between value alignment and reasoning ability. In addition, we further validate the performance of BNF on non-pairwise datasets, and conduct in-depth analysis of log-likelihood and logit shifts across different preference optimization methods.
Related papers
- Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations.
In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%.
DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z) - SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment [16.230186347702737]
We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel extension of Direct Preference Optimization (DPO)
SWEPO accommodate multiple dynamically chosen positive and negative responses for each query.
We show that simultaneously considering multiple preferences reduces alignment bias, resulting in more robust alignment.
Empirical validation on the UltraFeedback dataset establishes SWEPO as state-of-the-art, with superior performance in downstream evaluations.
arXiv Detail & Related papers (2024-12-05T21:50:22Z) - Direct Preference Optimization Using Sparse Feature-Level Constraints [47.15096507230884]
Feature-level constrained Preference Optimization is a novel method designed to simplify the alignment process while ensuring stability.
Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence.
arXiv Detail & Related papers (2024-11-12T07:54:13Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets.
ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data.
Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - $i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization [12.266207199002604]
Large Language Models (LLM) can sometimes produce outputs that deviate from human expectations.
We propose a novel framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization.
We show that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators.
arXiv Detail & Related papers (2024-05-24T05:42:11Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.