Related papers: Averaging log-likelihoods in direct alignment

Averaging log-likelihoods in direct alignment

URL: http://arxiv.org/abs/2406.19188v1
Date: Thu, 27 Jun 2024 14:07:38 GMT
Title: Averaging log-likelihoods in direct alignment
Authors: Nathan Grinsztajn, Yannis Flet-Berliac, Mohammad Gheshlaghi Azar, Florian Strub, Bill Wu, Eugene Choi, Chris Cremer, Arash Ahmadian, Yash Chandak, Olivier Pietquin, Matthieu Geist,
Abstract summary: We introduce a new averaging operator to be composed with the optimality operator giving the best policy for the underlying RL problem. We empirically study the effect of such averaging, observing a trade-off between the length of generations and their scores.
Score: 43.77763433288893
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: To better align Large Language Models (LLMs) with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a fine-tuned model directly from a preference dataset without computing a proxy reward function. These methods are built upon contrastive losses involving the log-likelihood of (dis)preferred completions according to the trained model. However, completions have various lengths, and the log-likelihood is not length-invariant. On the other side, the cross-entropy loss used in supervised training is length-invariant, as batches are typically averaged token-wise. To reconcile these approaches, we introduce a principled approach for making direct alignment length-invariant. Formally, we introduce a new averaging operator, to be composed with the optimality operator giving the best policy for the underlying RL problem. It translates into averaging the log-likelihood within the loss. We empirically study the effect of such averaging, observing a trade-off between the length of generations and their scores.

Related papers

Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback [12.158181906895186]
Reinforcement learning with human feedback has emerged as a central paradigm for aligning large language models with human preferences.<n>We investigate exploration principles for online RLHF, where one seeks to refine both the reward model and the policy in a data-efficient manner.<n>Motivated by this insight, we propose a new exploration scheme that directs preference queries toward reducing uncertainty in reward differences.
arXiv Detail & Related papers (2025-09-26T17:57:17Z)
PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training [9.093854840532062]
PITA is a novel framework that integrates preference feedback directly into the LLM's token generation.<n> PITA learns a small preference-based guidance policy to modify token probabilities at inference time without fine-tuning.<n>We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification.
arXiv Detail & Related papers (2025-07-26T21:46:32Z)
Bias Fitting to Mitigate Length Bias of Reward Model in RLHF [81.44256822500257]
Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences.<n>We propose FiMi-RM, a framework that autonomously learns and corrects underlying bias patterns.<n> Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution.
arXiv Detail & Related papers (2025-05-19T08:29:28Z)
Best Policy Learning from Trajectory Preference Feedback [15.799929216215672]
We address the problem of best policy identification in preference-based reinforcement learning (PbRL) We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret.
arXiv Detail & Related papers (2025-01-31T03:55:10Z)
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference [17.76565371753346]
This paper develops two RLHF algorithms without reward inference. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.
arXiv Detail & Related papers (2024-09-25T22:20:11Z)
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference. The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z)
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms [50.808123629394245]
Direct Alignment Algorithms (DDAs) like Direct Preference Optimization have emerged as alternatives to the classical RLHF pipeline. This work formulates and formalizes the reward over-optimization or hacking problem for DAAs and explores its consequences across objectives, training regimes, and model scales.
arXiv Detail & Related papers (2024-06-05T03:41:37Z)
Aligning Large Language Models via Fine-grained Supervision [20.35000061196631]
Pre-trained large-scale language models (LLMs) excel at producing coherent articles, yet their outputs may be untruthful, toxic, or fail to align with user expectations. Current approaches focus on using reinforcement learning with human feedback to improve model alignment. We propose a method to enhance LLM alignment through fine-grained token-level supervision.
arXiv Detail & Related papers (2024-06-04T20:21:45Z)
Preference Alignment with Flow Matching [23.042382086241364]
Preference Flow Matching (PFM) is a new framework for preference-based reinforcement learning (PbRL) It streamlines the integration of preferences into an arbitrary class of pre-trained models. We provide theoretical insights that support our method's alignment with standard PbRL objectives.
arXiv Detail & Related papers (2024-05-30T08:16:22Z)
Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem. PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins. Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z)
Disentangling Length from Quality in Direct Preference Optimization [93.74831404396174]
Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models. RLHF is know to exploit biases in human preferences, such as verbosity. We develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality.
arXiv Detail & Related papers (2024-03-28T06:03:47Z)
Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class. Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z)
Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions. CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z)
Non-ergodicity in reinforcement learning: robustness via ergodicity transformations [8.44491527275706]
Application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance. We argue that a fundamental issue contributing to this lack of robustness lies in the focus on the expected value of the return. We propose an algorithm for learning ergodicity from data and demonstrate its effectiveness in an instructive, non-ergodic environment.
arXiv Detail & Related papers (2023-10-17T15:13:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.