Related papers: All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning

URL: http://arxiv.org/abs/2503.01067v2
Date: Fri, 17 Oct 2025 14:56:28 GMT
Title: All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
Authors: Gokul Swamy, Sanjiban Choudhury, Wen Sun, Zhiwei Steven Wu, J. Andrew Bagnell,
Abstract summary: We show that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure.<n>Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback.<n>We find the most support for the explanation that on problems with a generation-verification gap, it is relatively easy to learn the relatively simple RM from the preference data.
Score: 49.43901716932925
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g., human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on said dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, (1) it is relatively easy to learn the relatively simple RM (verifier) from the preference data. Then, (2) the downstream RL procedure only returns policies (generators) that are optimal for such relatively simple verifiers. Thus, end-to-end, two-stage online FT only has to search over a reduced subset of the full space of policies, requiring less data than offline FT.

Related papers

Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training [61.1421888242439]
Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL)<n>We propose a framework to bridge this chasm by enabling On-Policy SFT.
arXiv Detail & Related papers (2026-02-12T17:59:58Z)
Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting [40.80967570661867]
Adapting language models to new tasks via post-training carries the risk of degrading existing capabilities.<n>We compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL)<n>RL leads to less forgetting than SFT while achieving comparable or higher target task performance.
arXiv Detail & Related papers (2025-10-21T17:59:41Z)
One-Token Rollout: Guiding Supervised Fine-Tuning of LLMs with Policy Gradient [16.05489579792086]
We introduce one-token rollout (OTR), a novel fine-tuning algorithm that guides SFT with the policy gradient method.<n>OTR reframes the autoregressive learning process by treating each token generation as a single-step reinforcement learning trajectory.<n>Our findings establish OTR as a powerful and practical alternative for fine-tuning LLMs.
arXiv Detail & Related papers (2025-09-30T14:25:56Z)
Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback [12.158181906895186]
Reinforcement learning with human feedback has emerged as a central paradigm for aligning large language models with human preferences.<n>We investigate exploration principles for online RLHF, where one seeks to refine both the reward model and the policy in a data-efficient manner.<n>Motivated by this insight, we propose a new exploration scheme that directs preference queries toward reducing uncertainty in reward differences.
arXiv Detail & Related papers (2025-09-26T17:57:17Z)
wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models [15.638885149395657]
Intractability of dLLMs likelihood function requires approximating the current, old, and reference policy likelihoods at each policy optimization step.<n>We introduce $mathttwd1$, a novel policy optimization approach that reformulates the objective as a weighted likelihood.<n>Experiments on widely used reasoning benchmarks demonstrate that $mathttwd1$, without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs.
arXiv Detail & Related papers (2025-07-07T21:27:25Z)
Best Policy Learning from Trajectory Preference Feedback [15.799929216215672]
We address the problem of best policy identification in preference-based reinforcement learning (PbRL)<n>We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling.<n>We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret.
arXiv Detail & Related papers (2025-01-31T03:55:10Z)
Online Preference Alignment for Language Models via Count-based Exploration [46.46627519343809]
Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences.<n>Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage.<n>Online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs.
arXiv Detail & Related papers (2025-01-22T09:12:09Z)
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO) VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function. Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z)
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Model-based Offline Imitation Learning with Non-expert Data [7.615595533111191]
We propose a scalable model-based offline imitation learning algorithmic framework that leverages datasets collected by both suboptimal and optimal policies. We show that the proposed method textitalways outperforms Behavioral Cloning in the low data regime on simulated continuous control domains.
arXiv Detail & Related papers (2022-06-11T13:08:08Z)
False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
Interpretable performance analysis towards offline reinforcement learning: A dataset perspective [6.526790418943535]
We propose a two-fold taxonomy for existing offline RL algorithms. We explore the correlation between the performance of different types of algorithms and the distribution of actions under states. We create a benchmark platform on the Atari domain, entitled easy go (RLEG), at an estimated cost of more than 0.3 million dollars.
arXiv Detail & Related papers (2021-05-12T07:17:06Z)
Instabilities of Offline RL with Pre-Trained Neural Representation [127.89397629569808]
In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated. Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold. This work studies these issues from an empirical perspective to gauge how stable offline RL methods are.
arXiv Detail & Related papers (2021-03-08T18:06:44Z)
Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
Provably Efficient Causal Reinforcement Learning with Confounded Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting. We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.