Orthogonal Finetuning for Direct Preference Optimization
- URL: http://arxiv.org/abs/2409.14836v2
- Date: Tue, 24 Sep 2024 03:22:15 GMT
- Title: Orthogonal Finetuning for Direct Preference Optimization
- Authors: Chenxu Yang, Ruipeng Jia, Naibin Gu, Zheng Lin, Siyuan Chen, Chao Pang, Weichong Yin, Yu Sun, Hua Wu, Weiping Wang,
- Abstract summary: We introduce finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method.
RoPO conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant.
Our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters.
- Score: 46.38508475165443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.
Related papers
- Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences.
Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function.
We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z) - Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models [73.88009808326387]
We propose a novel spectrum-aware adaptation framework for generative models.
Our method adjusts both singular values and their basis vectors of pretrained weights.
We introduce Spectral Ortho Decomposition Adaptation (SODA), which balances computational efficiency and representation capacity.
arXiv Detail & Related papers (2024-05-31T17:43:35Z) - Spectral Adapter: Fine-Tuning in Spectral Space [45.72323731094864]
We study the enhancement of current PEFT methods by incorporating the spectral information of pretrained weight matrices into the fine-tuning procedure.
We show through extensive experiments that the proposed fine-tuning model enables better parameter efficiency and tuning performance as well as benefits multi-adapter fusion.
arXiv Detail & Related papers (2024-05-22T19:36:55Z) - Generalized Preference Optimization: A Unified Approach to Offline Alignment [54.97015778517253]
We propose generalized preference optimization (GPO), a family of offline losses parameterized by a general class of convex functions.
GPO enables a unified view over preference optimization, encompassing existing algorithms such as DPO, IPO and SLiC as special cases.
Our results present new algorithmic toolkits and empirical insights to alignment practitioners.
arXiv Detail & Related papers (2024-02-08T15:33:09Z) - Sample-efficient Iterative Lower Bound Optimization of Deep Reactive
Policies for Planning in Continuous MDPs [27.41101006357176]
In this work, we take a minorization-maximization perspective to iteratively optimize the.
w.r.t. a locally tight lower-bounded objective.
This novel formulation of learning as iterative lower bound optimization (ILBO) is particularly appealing because (i) each step is structurally easier to optimize than the overall objective.
Empirical evaluation confirms that ILBO is significantly more sample-efficient than the state-of-the-art planner.
arXiv Detail & Related papers (2022-03-23T19:06:16Z) - Variational Refinement for Importance Sampling Using the Forward
Kullback-Leibler Divergence [77.06203118175335]
Variational Inference (VI) is a popular alternative to exact sampling in Bayesian inference.
Importance sampling (IS) is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures.
We propose a novel combination of optimization and sampling techniques for approximate Bayesian inference.
arXiv Detail & Related papers (2021-06-30T11:00:24Z) - The Role of Momentum Parameters in the Optimal Convergence of Adaptive
Polyak's Heavy-ball Methods [12.93796690939018]
We prove that the adaptive Polyak's Heavy-ball (HB) method attains an optimal individual convergence rate of $O(frac1sqrtt)$.
Our new analysis shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization.
arXiv Detail & Related papers (2021-02-15T02:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.