Related papers: Doubly Robust Alignment for Large Language Models

Doubly Robust Alignment for Large Language Models

URL: http://arxiv.org/abs/2506.01183v1
Date: Sun, 01 Jun 2025 21:34:37 GMT
Title: Doubly Robust Alignment for Large Language Models
Authors: Erhan Xu, Kai Ye, Hongyi Zhou, Luhan Zhu, Francesco Quinzan, Chengchun Shi,
Abstract summary: This paper studies reinforcement learning from human feedback for aligning large language models with human preferences.<n>We propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified.
Score: 10.092889408835656
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM

Related papers

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF [14.663977441172115]
We introduce a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning perspective.<n>By leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns.<n>In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances.
arXiv Detail & Related papers (2025-11-30T03:16:20Z)
Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences [14.686788596611246]
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values.<n>Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences.<n>We propose a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
arXiv Detail & Related papers (2025-10-17T15:00:40Z)
RankPO: Preference Optimization for Job-Talent Matching [7.385902340910447]
We propose a two-stage training framework for large language models (LLMs)<n>In the first stage, a contrastive learning approach is used to train the model on a dataset constructed from real-world matching rules.<n>In the second stage, we introduce a novel preference-based fine-tuning method inspired by Direct Preference Optimization (DPO) to align the model with AI-curated pairwise preferences.
arXiv Detail & Related papers (2025-03-13T10:14:37Z)
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment [59.536850459059856]
We introduce MM-RLHF, a dataset containing $mathbf120k$ fine-grained, human-annotated preference comparison pairs.<n>We propose several key innovations to improve the quality of reward models and the efficiency of alignment algorithms.<n>Our approach is rigorously evaluated across $mathbf10$ distinct dimensions and $mathbf27$ benchmarks.
arXiv Detail & Related papers (2025-02-14T18:59:51Z)
Clone-Robust AI Alignment [20.38824614301761]
Reinforcement Learning with Human Feedback (RLHF) uses pairwise comparisons from human annotators to train reward functions.<n>We introduce robustness to approximate clones, a desirable property of RLHF algorithms.<n>We propose the weighted MLE, a new RLHF algorithm that modifies the standard regularized maximum likelihood estimation.
arXiv Detail & Related papers (2025-01-16T02:43:44Z)
Ordinal Preference Optimization: Aligning Human Preferences via NDCG [28.745322441961438]
We develop an end-to-end preference optimization algorithm by approxing NDCG with a differentiable surrogate loss. OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval.
arXiv Detail & Related papers (2024-10-06T03:49:28Z)
Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.<n>We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.<n>Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z)
Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers. Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z)
Preference Alignment with Flow Matching [23.042382086241364]
Preference Flow Matching (PFM) is a new framework for preference-based reinforcement learning (PbRL) It streamlines the integration of preferences into an arbitrary class of pre-trained models. We provide theoretical insights that support our method's alignment with standard PbRL objectives.
arXiv Detail & Related papers (2024-05-30T08:16:22Z)
Preference Learning Algorithms Do Not Learn Preference Rankings [62.335733662381884]
We study the conventional wisdom that preference learning trains models to assign higher likelihoods to more preferred outputs than less preferred outputs. We find that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common preference datasets.
arXiv Detail & Related papers (2024-05-29T21:29:44Z)
Direct Preference Optimization With Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences [14.686788596611246]
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values.<n>Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences.<n>We propose a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
arXiv Detail & Related papers (2024-05-23T21:25:20Z)
Comparing Bad Apples to Good Oranges: Aligning Large Language Models via Joint Preference Optimization [105.3612692153615]
We propose a new axis based on eliciting preferences jointly over instruction-response pairs.<n>Joint preferences over instruction and response pairs can significantly enhance the alignment of large language models.
arXiv Detail & Related papers (2024-03-31T02:05:40Z)
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization [76.09576643028362]
We present Multi-Objective Direct Preference Optimization (MODPO) for multiple alignment objectives. MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models. It theoretically yields the same optimal solutions as MORLHF but is practically more stable and efficient.
arXiv Detail & Related papers (2023-10-05T17:35:26Z)
The Wisdom of Hindsight Makes Language Models Better Instruction Followers [84.9120606803906]
Reinforcement learning has seen wide success in finetuning large language models to better align with instructions via human feedback. In this paper, we consider an alternative approach: converting feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner. We propose Hindsight Instruction Relabeling (HIR), a novel algorithm for aligning language models with instructions.
arXiv Detail & Related papers (2023-02-10T12:16:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.