Related papers: REFA: Reference Free Alignment for multi-preference optimization

Related papers

Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Length-Controlled Margin-Based Preference Optimization without Reference Model [11.878496378814045]
We propose Length-Controlled Margin-Based Preference Optimization (LMPO) for preference-based reinforcement learning. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within the Bradley-Terry framework. Our experimental results demonstrate that LMPO effectively controls response length, reduces probability degradation, and outperforms existing approaches.
arXiv Detail & Related papers (2025-02-20T15:30:27Z)
Smoothed Normalization for Efficient Distributed Private Optimization [54.197255548244705]
Federated learning enables machine learning models with privacy of participants. There is no differentially private distributed method for training, non-feedback problems. We introduce a new distributed algorithm $alpha$-$sf NormEC$ with provable convergence guarantees.
arXiv Detail & Related papers (2025-02-19T07:10:32Z)
Simplify RLHF as Reward-Weighted SFT: A Variational Method [34.222095430239555]
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. We propose a novel simplification of RLHF from the perspective of variational inference. We transform the alignment objective into a reward-driven supervised fine-tuning form to obtain noticeable improvement on training stability and effectiveness.
arXiv Detail & Related papers (2025-02-16T07:22:00Z)
REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models [8.587685197004097]
REINFORCE++ is a novel approach that removes the critic model while using the normalized reward of a batch as the baseline. It exhibits robust performance across various reward models without requiring prompt set truncation. It achieves superior generalization in both RLHF and long chain-of-thought settings compared to existing REINFORCE-based methods.
arXiv Detail & Related papers (2025-01-04T02:08:06Z)
SWEPO: Simultaneous Weighted Preference Optimization for Group Contrastive Alignment [16.230186347702737]
We propose Simultaneous Weighted Preference Optimization (SWEPO) SWEPO incorporates multiple responses per query and prioritizes those that deviate most from the average reward. We prove that such multi-preference sampling lowers alignment bias, bounding the expected deviation from the true acceptable-response distribution at a rate of $mathcalO(tfrac1sqrtk)$.
arXiv Detail & Related papers (2024-12-05T21:50:22Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We show that our approach consistently boosts DPO by a considerable margin. Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
Geometric-Averaged Preference Optimization for Soft Preference Labels [78.2746007085333]
Many algorithms for aligning LLMs with human preferences assume that human preferences are binary and deterministic. In this work, we introduce the distributional soft preference labels and improve Direct Preference Optimization (DPO) with a weighted geometric average of the LLM output likelihood in the loss function.
arXiv Detail & Related papers (2024-09-10T17:54:28Z)
Reward Difference Optimization For Sample Reweighting In Offline RLHF [18.62836654699957]
Current offline RLHF only captures the "ordinal relationship" between responses, overlooking the crucial aspect of how much one is preferred over the others. We propose a simple yet effective solution called Reward Difference Optimization, shorted as RDO. Experiments with 7B LLMs on the HH and TL;DR datasets substantiate the effectiveness of our method in both automatic metrics and human evaluation.
arXiv Detail & Related papers (2024-08-18T07:04:16Z)
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization. $chi$PO implements the principle of pessimism in the face of uncertainty via regularization. $chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z)
Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers. Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z)
Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback [56.6950165117658]
We consider offline reinforcement learning with preference feedback in which the implicit reward is a linear function of an unknown parameter. We propose an algorithm, underlineRL with underlineLocally underlineOptimal underlineWeights or sc RL-LOW, which yields a simple regret of $exp. We observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of sc RL-LOW.
arXiv Detail & Related papers (2024-06-18T02:03:12Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
$i$REPO: $i$mplicit Reward Pairwise Difference based Empirical Preference Optimization [12.266207199002604]
Large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. We propose a novel framework named $i$REPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization. We show that $i$REPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators.
arXiv Detail & Related papers (2024-05-24T05:42:11Z)
Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences. We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game. Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z)
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step. Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z)
Preference Ranking Optimization for Human Alignment [90.6952059194946]
Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. We propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to fine-tune LLMs for human alignment.
arXiv Detail & Related papers (2023-06-30T09:07:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.