Related papers: FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings

URL: http://arxiv.org/abs/2501.06645v1
Date: Sat, 11 Jan 2025 21:41:27 GMT
Title: FocalPO: Enhancing Preference Optimizing by Focusing on Correct Preference Rankings
Authors: Tong Liu, Xiao Yu, Wenxuan Zhou, Jindong Gu, Volker Tresp,
Abstract summary: We introduce FocalPO, a DPO variant that prioritizes enhancing the model's understanding of pairs that it can already rank correctly.<n>Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss.
Score: 40.605411087380226
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficient preference optimization algorithms such as Direct Preference Optimization (DPO) have become a popular approach in aligning large language models (LLMs) with human preferences. These algorithms implicitly treat the LLM as a reward model, and focus on training it to correct misranked preference pairs. However, recent work~\citep{chen2024preference} empirically finds that DPO training \textit{rarely improves these misranked preference pairs}, despite its gradient emphasizing on these cases. We introduce FocalPO, a DPO variant that instead \textit{down-weighs} misranked preference pairs and prioritizes enhancing the model's understanding of pairs that it can already rank correctly. Inspired by Focal Loss used in vision tasks, FocalPO achieves this by adding a modulating factor to dynamically scale DPO loss. Our experiment demonstrates that FocalPO surpasses DPO and its variants on popular benchmarks like Alpaca Eval 2.0 using Mistral-Base-7B and Llama-3-Instruct-8B. Additionally, we empirically reveals how FocalPO affects training on correct and incorrect sample groups, further underscoring its effectiveness.

Related papers

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge [35.703451475662995]
We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences.<n>MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective.<n>MaPPO can be used as a plugin with consistent improvement on DPO variants.
arXiv Detail & Related papers (2025-07-27T05:26:50Z)
Towards Self-Improvement of Diffusion Models via Group Preference Optimization [10.6096255671291]
Group Preference Optimization (GPO) is an effective self-improvement method that enhances performance without requiring external data.<n>GPO improves the accurate counting and text rendering capabilities of the Stable Diffusion 3.5 Medium by 20 percentage points.<n>As a plug-and-play method, no extra overhead is introduced during inference.
arXiv Detail & Related papers (2025-05-16T10:04:57Z)
Inducing Robustness in a 2 Dimensional Direct Preference Optimization Paradigm [16.66633426354087]
Direct Preference optimisation (DPO) has emerged as a powerful method for aligning Large Language Models with human preferences.<n>We investigate the performance of DPO using open-source preference datasets.<n>We propose an approach of incorporating segment-level score noise robustness to the 2D-DPO algorithm.
arXiv Detail & Related papers (2025-05-03T05:59:13Z)
Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective [22.248134630764497]
We propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. Our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences.
arXiv Detail & Related papers (2025-02-20T07:53:11Z)
AlphaPO -- Reward shape matters for LLM alignment [6.943629384682674]
We introduce AlphaPO, a new DAA method that helps change the shape of the reward function beyond the standard log reward.<n>Compared to SimPO, one of the best performing DAAs, AlphaPO leads to about 7% to 10% relative improvement in alignment performance.
arXiv Detail & Related papers (2025-01-07T15:46:42Z)
$α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models. It balances the policy model and the reference model to achieve personalized reward margins. It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z)
Accelerated Preference Optimization for Large Language Model Alignment [60.22606527763201]
Reinforcement Learning from Human Feedback (RLHF) has emerged as a pivotal tool for aligning large language models (LLMs) with human preferences. Direct Preference Optimization (DPO) formulates RLHF as a policy optimization problem without explicitly estimating the reward function. We propose a general Accelerated Preference Optimization (APO) framework, which unifies many existing preference optimization algorithms.
arXiv Detail & Related papers (2024-10-08T18:51:01Z)
TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z)
Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z)
Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level [50.897438358317686]
We show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5%$ length-controlled win rate against $texttGPT-4 Preview$ on AlpacaEval 2.0.
arXiv Detail & Related papers (2024-06-17T17:55:38Z)
Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization [34.29965046863887]
Triple Preference Optimization (TPO) is a new preference learning method designed to enhance both reasoning and instruction-following abilities. TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes.
arXiv Detail & Related papers (2024-05-26T20:18:11Z)
MallowsPO: Fine-Tune Your LLM with Preference Dispersions [9.697663437292848]
Direct Preference Optimization (DPO) has emerged as a popular approach to improve reinforcement learning with human feedback. Inspired by Mallows' theory of preference ranking, we develop in this paper a new approach, the MallowsPO. A distinct feature of this approach is a dispersion index, which reflects the dispersion of human preference to prompts.
arXiv Detail & Related papers (2024-05-23T18:01:11Z)
D2PO: Discriminator-Guided DPO with Response Evaluation Models [63.71853401569461]
We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
arXiv Detail & Related papers (2024-05-02T17:44:41Z)
Model Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to expedite alignment training with human preferences.<n>We demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one.<n>We show that ExPO notably improves existing open-source LLMs on the leading AlpacaEval 2.0 and MT-Bench benchmarks.
arXiv Detail & Related papers (2024-04-25T17:39:50Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.