$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization
- URL: http://arxiv.org/abs/2410.21662v2
- Date: Sun, 16 Feb 2025 20:19:36 GMT
- Title: $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization
- Authors: Jiaqi Han, Mingjian Jiang, Yuxuan Song, Stefano Ermon, Minkai Xu,
- Abstract summary: $f$-PO is a novel framework that generalizes and extends existing approaches.
We conduct experiments on state-of-the-art language models using benchmark datasets.
- Score: 54.94545757220999
- License:
- Abstract: Preference optimization has made significant progress recently, with numerous methods developed to align language models with human preferences. This paper introduces $f$-divergence Preference Optimization ($f$-PO), a novel framework that generalizes and extends existing approaches. $f$-PO minimizes $f$-divergences between the optimized policy and the optimal policy, encompassing a broad family of alignment methods using various divergences. Our approach unifies previous algorithms like DPO and EXO, while offering new variants through different choices of $f$-divergences. We provide theoretical analysis of $f$-PO's properties and conduct extensive experiments on state-of-the-art language models using benchmark datasets. Results demonstrate $f$-PO's effectiveness across various tasks, achieving superior performance compared to existing methods on popular benchmarks such as AlpacaEval 2, Arena-Hard, MT-Bench, and Open LLM Leaderboard v2. Additionally, we present ablation studies exploring the impact of different $f$-divergences, offering insights into the trade-offs between regularization and performance in offline preference optimization. Our work contributes both practical algorithms and theoretical understanding to the field of language model alignment. Code is available at https://github.com/MinkaiXu/fPO.
Related papers
- CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.
We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.
We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - AIPO: Improving Training Objective for Iterative Preference Optimization [34.24211649396053]
We study iterative preference optimization with synthetic data.
We propose our training objective for iterative preference optimization, namely Agreement-aware Iterative Preference Optimization (AIPO)
arXiv Detail & Related papers (2024-09-13T14:03:49Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.
$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.
$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - $i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization [12.266207199002604]
Large Language Models (LLM) can sometimes produce outputs that deviate from human expectations.
We propose a novel framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization.
We show that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators.
arXiv Detail & Related papers (2024-05-24T05:42:11Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - qPOTS: Efficient batch multiobjective Bayesian optimization via Pareto optimal Thompson sampling [0.0]
A sample-efficient approach to solving multiobjective optimization is via process oracle (GP) surrogates and MOBOOTS$.
We propose a Thompson sampling (TS) based approach ($qtextttPOTS$)
$qtextttPOTS$ solves a cheap multiobjective optimization on the GP posteriors with evolutionary approaches.
arXiv Detail & Related papers (2023-10-24T12:35:15Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.