Related papers: Distributional Preference Alignment of LLMs via Optimal Transport

Distributional Preference Alignment of LLMs via Optimal Transport

URL: http://arxiv.org/abs/2406.05882v1
Date: Sun, 9 Jun 2024 18:41:05 GMT
Title: Distributional Preference Alignment of LLMs via Optimal Transport
Authors: Igor Melnyk, Youssef Mroueh, Brian Belgodere, Mattia Rigotti, Apoorva Nitsure, Mikhail Yurochkin, Kristjan Greenewald, Jiri Navratil, Jerret Ross,
Abstract summary: We propose a novel method for distributional preference alignment of LLMs called Alignment via Optimal Transport (AOT) AOT aligns LLMs on unpaired preference data by making the reward distribution of the positive samplesally dominant in the first order on the distribution of negative samples. We show that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and AlpacaEval.
Score: 36.95053112313244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current LLM alignment techniques use pairwise human preferences at a sample level, and as such, they do not imply an alignment on the distributional level. We propose in this paper Alignment via Optimal Transport (AOT), a novel method for distributional preference alignment of LLMs. AOT aligns LLMs on unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. We introduce a convex relaxation of this first-order stochastic dominance and cast it as an optimal transport problem with a smooth and convex cost. Thanks to the one-dimensional nature of the resulting optimal transport problem and the convexity of the cost, it has a closed-form solution via sorting on empirical measures. We fine-tune LLMs with this AOT objective, which enables alignment by penalizing the violation of the stochastic dominance of the reward distribution of the positive samples on the reward distribution of the negative samples. We analyze the sample complexity of AOT by considering the dual of the OT problem and show that it converges at the parametric rate. Empirically, we show on a diverse set of alignment datasets and LLMs that AOT leads to state-of-the-art models in the 7B family of models when evaluated with Open LLM Benchmarks and AlpacaEval.

Related papers

Leveraging Robust Optimization for LLM Alignment under Distribution Shifts [54.654823811482665]
Large language models (LLMs) increasingly rely on preference alignment methods to steer outputs toward human values. Recent approaches have turned to synthetic data generated by LLMs as a scalable alternative. We propose a novel distribution-aware optimization framework that improves preference alignment in the presence of such shifts.
arXiv Detail & Related papers (2025-04-08T09:14:38Z)
Direct Distributional Optimization for Provable Alignment of Diffusion Models [39.048284342436666]
We introduce a novel alignment method for diffusion models from distribution optimization perspectives. We first formulate the problem as a generic regularized loss minimization over probability distributions. We enable sampling from the learned distribution by approximating its score function via Doob's $h$-transform technique.
arXiv Detail & Related papers (2025-02-05T07:35:15Z)
BELM: Bidirectional Explicit Linear Multi-step Sampler for Exact Inversion in Diffusion Models [11.063964007950249]
We introduce a generic formulation, emphBidirectional Linear Explicit Multi-step (BELM) samplers. The BELM formulation is derived from the variable-stepsize-formula linear multi-step method. We show that the existing designs of exact inversion samplers yield sub-optimal minimization.
arXiv Detail & Related papers (2024-10-09T06:32:26Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game [31.66896160733569]
We propose an Adversarial Preference Optimization (APO) framework to target more efficient human preference optimization. We find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness.
arXiv Detail & Related papers (2023-11-14T10:10:31Z)
Learning to Re-weight Examples with Optimal Transport for Imbalanced Classification [74.62203971625173]
Imbalanced data pose challenges for deep learning based classification models. One of the most widely-used approaches for tackling imbalanced data is re-weighting. We propose a novel re-weighting method based on optimal transport (OT) from a distributional point of view.
arXiv Detail & Related papers (2022-08-05T01:23:54Z)
Learning Optimal Transport Between two Empirical Distributions with Normalizing Flows [12.91637880428221]
We propose to leverage the flexibility of neural networks to learn an approximate optimal transport map. We show that a particular instance of invertible neural networks, namely the normalizing flows, can be used to approximate the solution of this OT problem.
arXiv Detail & Related papers (2022-07-04T08:08:47Z)
Rethinking Collaborative Metric Learning: Toward an Efficient Alternative without Negative Sampling [156.7248383178991]
Collaborative Metric Learning (CML) paradigm has aroused wide interest in the area of recommendation systems (RS) We find that negative sampling would lead to a biased estimation of the generalization error. Motivated by this, we propose an efficient alternative without negative sampling for CML named textitSampling-Free Collaborative Metric Learning (SFCML)
arXiv Detail & Related papers (2022-06-23T08:50:22Z)
Variational Refinement for Importance Sampling Using the Forward Kullback-Leibler Divergence [77.06203118175335]
Variational Inference (VI) is a popular alternative to exact sampling in Bayesian inference. Importance sampling (IS) is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures. We propose a novel combination of optimization and sampling techniques for approximate Bayesian inference.
arXiv Detail & Related papers (2021-06-30T11:00:24Z)
Midpoint Regularization: from High Uncertainty Training to Conservative Classification [19.252319300590653]
Label Smoothing (LS) improves model generalization through penalizing models from generating overconfident output distributions. We extend this technique by considering example pairs, coined PLS. PLS first creates midpoint samples by averaging random sample pairs and then learns a smoothing distribution during training for each of these midpoint samples, resulting in midpoints with high uncertainty labels for training.
arXiv Detail & Related papers (2021-06-26T00:31:46Z)
A Distributional Approach to Controlled Text Generation [3.279201607581627]
We propose a Distributional Approach to address Controlled Text Generation from pre-trained Language Models (LMs) This view permits to define, in a single formal framework, "pointwise" and "distributional" constraints over the target LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models.
arXiv Detail & Related papers (2020-12-21T19:02:41Z)
Learning to Match Distributions for Domain Adaptation [116.14838935146004]
This paper proposes Learning to Match (L2M) to automatically learn the cross-domain distribution matching. L2M reduces the inductive bias by using a meta-network to learn the distribution matching loss in a data-driven way. Experiments on public datasets substantiate the superiority of L2M over SOTA methods.
arXiv Detail & Related papers (2020-07-17T03:26:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.