Related papers: AMPO: Active Multi-Preference Optimization

AMPO: Active Multi-Preference Optimization

URL: http://arxiv.org/abs/2502.18293v1
Date: Tue, 25 Feb 2025 15:29:51 GMT
Title: AMPO: Active Multi-Preference Optimization
Authors: Taneesh Gupta, Rahul Madhavan, Xuchao Zhang, Chetan Bansal, Saravan Rajmohan,
Abstract summary: Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses.<n>We propose $textitActive Multi-Preference Optimization$ (AMPO), a novel approach that combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection.<n>AMPO achieves state-of-the-art results on $textitAlpacaEval$ using Llama 8B.
Score: 16.230186347702737
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-preference optimization enriches language-model alignment beyond pairwise preferences by contrasting entire sets of helpful and undesired responses, thereby enabling richer training signals for large language models. During self-play alignment, these models often produce numerous candidate answers per query, rendering it computationally infeasible to include all responses in the training objective. In this work, we propose $\textit{Active Multi-Preference Optimization}$ (AMPO), a novel approach that combines on-policy generation, a multi-preference group-contrastive loss, and active subset selection. Specifically, we score and embed large candidate pools of responses and then select a small, yet informative, subset that covers reward extremes and distinct semantic clusters for preference optimization. Our contrastive training scheme is capable of identifying not only the best and worst answers but also subtle, underexplored modes that are crucial for robust alignment. Theoretically, we provide guarantees for expected reward maximization using our active selection method, and empirically, AMPO achieves state-of-the-art results on $\textit{AlpacaEval}$ using Llama 8B.

Related papers

DCRM: A Heuristic to Measure Response Pair Quality in Preference Optimization [13.366219858094297]
We study 3 types of commonly used preference datasets, classified along two axes: the source of the responses and the preference labeling function.<n>We establish a general correlation between higher Distance Calibrated Reward Margin of the training set and better learning outcome.<n>Our method produces training datasets that can further improve models' performance on AlpacaEval, MT-Bench, and Arena-Hard.
arXiv Detail & Related papers (2025-06-17T03:37:41Z)
Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment [74.25832963097658]
Multi-Objective Alignment (MOA) aims to align responses with multiple human preference objectives.<n>We find that DPO-based MOA approaches suffer from widespread preference conflicts in the data.
arXiv Detail & Related papers (2025-02-20T08:27:00Z)
Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts [17.243429150450886]
We propose $textbfMulti-Preference Optimization (MPO) to optimize over entire sets of responses.<n>MPO employs deviation-based weighting, which emphasizes outlier responses that deviate most from the mean reward.<n>We theoretically prove that MPO reduces alignment bias at a rate of $mathcalOleft(frac1sqrtnright)$ with respect to the number of responses per query.
arXiv Detail & Related papers (2024-12-05T21:50:22Z)
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [54.94545757220999]
$f$-PO is a novel framework that generalizes and extends existing approaches.<n>We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z)
Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning. We construct pseudo-skill clusters by grouping gradient-based sample vectors. We select the best-performing data selector for each skill cluster from a pool of selector experts. This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Traversing Pareto Optimal Policies: Provably Efficient Multi-Objective Reinforcement Learning [14.260168974085376]
This paper investigates multi-objective reinforcement learning (MORL) It focuses on learning optimal policies in the presence of multiple reward functions. Despite MORL's success, there is still a lack of satisfactory understanding of various MORL optimization targets and efficient learning algorithms.
arXiv Detail & Related papers (2024-07-24T17:58:49Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs.<n>Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)
qPOTS: Efficient batch multiobjective Bayesian optimization via Pareto optimal Thompson sampling [0.0]
A sample-efficient approach to solving multiobjective optimization is via process oracle (GP) surrogates and MOBOOTS$.<n>We propose a Thompson sampling (TS) based approach ($qtextttPOTS$)<n>$qtextttPOTS$ solves a cheap multiobjective optimization on the GP posteriors with evolutionary approaches.
arXiv Detail & Related papers (2023-10-24T12:35:15Z)
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization [76.09576643028362]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives. MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models. It theoretically yields the same optimal solutions as MORLHF but is practically more stable and efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z)
Resource Aware Multifidelity Active Learning for Efficient Optimization [0.8717253904965373]
This paper introduces the Resource Aware Active Learning (RAAL) strategy to accelerate the optimization of black box functions. The RAAL strategy optimally seeds multiple points at each allowing for a major speed up of the optimization task.
arXiv Detail & Related papers (2020-07-09T10:01:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.