Related papers: ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

URL: http://arxiv.org/abs/2505.19241v1
Date: Sun, 25 May 2025 17:42:52 GMT
Title: ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
Authors: Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low,
Abstract summary: Using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks.<n>Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions.<n>We propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions.
Score: 94.36403843133616
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.

Related papers

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities [27.09178257629886]
InfiAlign is a scalable and sample-efficient post-training framework for large language models (LLMs)<n>At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning.<n>Our results highlight the effectiveness of combining principled data selection with full-stage post-training.
arXiv Detail & Related papers (2025-08-07T15:34:06Z)
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap [13.89078939095465]
We introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism.<n>Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks.
arXiv Detail & Related papers (2025-08-06T07:24:14Z)
Adaptive Sample Scheduling for Direct Preference Optimization [37.75208455935495]
We introduce a novel problem: Sample Scheduling for DPO.<n>It aims to dynamically and adaptively schedule training samples based on the model's evolving states.<n>We propose SamS, an efficient and effective algorithm that adaptively selects samples in each training batch.
arXiv Detail & Related papers (2025-06-08T10:26:09Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Efficient Alignment of Large Language Models via Data Sampling [0.4915744683251149]
We propose an information theory-based methodology for efficient alignment by identifying a small high quality subset.<n>We find that the model aligned using our proposed methodology outperforms other sampling methods and performs comparable to the model aligned with the full dataset.
arXiv Detail & Related papers (2024-11-15T19:36:15Z)
Optimizing LLMs with Direct Preferences: A Data Efficiency Perspective [4.548047308860141]
This study investigates the impact of different type of preference data on model performance. It aims to reduce their dependency on extensive amounts of preference data, which is expensive to collect.
arXiv Detail & Related papers (2024-10-22T00:11:41Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.<n>TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z)
ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets. ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data. Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.