Related papers: AlphaPO - Reward shape matters for LLM alignment

AlphaPO - Reward shape matters for LLM alignment

URL: http://arxiv.org/abs/2501.03884v2
Date: Fri, 14 Feb 2025 23:35:48 GMT
Title: AlphaPO - Reward shape matters for LLM alignment
Authors: Aman Gupta, Shao Tang, Qingquan Song, Sirou Zhu, Jiwoo Hong, Ankan Saha, Viral Gupta, Noah Lee, Eunki Kim, Siyu Zhu, Parag Agarwal, Natesh Pillai, S. Sathiya Keerthi,
Abstract summary: We introduce textbfAlphaPO, a new DAA that helps change the shape of the reward function beyond the standard log reward.<n>Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance.
Score: 8.688476316386176
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Human Feedback (RLHF) and its variants have made huge strides toward the effective alignment of large language models (LLMs) to follow instructions and reflect human values. More recently, Direct Alignment Algorithms (DAAs) have emerged in which the reward modeling stage of RLHF is skipped by characterizing the reward directly as a function of the policy being learned. Some popular examples of DAAs include Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO). These methods often suffer from likelihood displacement, a phenomenon by which the probabilities of preferred responses are often reduced undesirably. In this paper, we argue that, for DAAs the reward (function) shape matters. We introduce \textbf{AlphaPO}, a new DAA method that leverages an $\alpha$-parameter to help change the shape of the reward function beyond the standard log reward. AlphaPO helps maintain fine-grained control over likelihood displacement and over-optimization. Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7\% to 10\% relative improvement in alignment performance for the instruct versions of Mistral-7B and Llama3-8B while achieving 15\% to 50\% relative improvement over DPO on the same models. The analysis and results presented highlight the importance of the reward shape, and how one can systematically change it to affect training dynamics, as well as improve alignment performance.

Related papers

Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms [6.844727437767206]
Direct Alignment Algorithms (DAAs) have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms.<n>DAAs suffer from a misalignment between optimization objectives during training and actual generation performance during inference.<n>We introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET)<n>POET truncates both preferred and dispreferred responses to match the shorter one's length.
arXiv Detail & Related papers (2025-06-11T07:02:18Z)
On-Policy RL with Optimal Reward Baseline [109.47676554514193]
On-Policy RL with Optimal reward baseline (OPO) is a novel and simplified reinforcement learning algorithm.<n>OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration.<n>Results demonstrate OPO's superior performance and training stability without additional models or regularization terms.
arXiv Detail & Related papers (2025-05-29T15:58:04Z)
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective [22.248134630764497]
We propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. Our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences.
arXiv Detail & Related papers (2025-02-20T07:53:11Z)
FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings [40.605411087380226]
We introduce FocalPO, a DPO variant that prioritizes enhancing the model's understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss.
arXiv Detail & Related papers (2025-01-11T21:41:27Z)
$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models. It balances the policy model and the reference model to achieve personalized reward margins. It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z)
Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function. We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z)
Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z)
Bootstrapping Language Models with DPO Implicit Rewards [45.68366127605774]
Direct preference optimization (DPO) has greatly simplified the process from past work in reinforcement learning from human feedback. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment.
arXiv Detail & Related papers (2024-06-14T06:57:18Z)
Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms [50.808123629394245]
Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline. This work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.
arXiv Detail & Related papers (2024-06-05T03:41:37Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference. We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models. We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z)
Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment [37.52249093928251]
This paper proposes a new framework, reinforcement learning with relative feedback, and a novel trajectory-wise policy gradient algorithm. We show theoretically that P3O is invariant to equivalent rewards and avoids the complexity of PPO. Empirical evaluations demonstrate that P3O outperforms PPO in the KL-Reward trade-off and can align with human preferences as well as or better than prior methods.
arXiv Detail & Related papers (2023-09-30T01:23:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.