RL with KL penalties is better viewed as Bayesian inference
- URL: http://arxiv.org/abs/2205.11275v1
- Date: Mon, 23 May 2022 12:47:13 GMT
- Title: RL with KL penalties is better viewed as Bayesian inference
- Authors: Tomasz Korbak and Ethan Perez and Christopher L Buckley
- Abstract summary: We analyze challenges associated with treating a language model as anReinforcement Learning policy.
We show how avoiding those challenges requires moving beyond the RL paradigm.
- Score: 4.473139775790299
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) is frequently employed in fine-tuning large
language models (LMs), such as GPT-3, to penalize them for undesirable features
of generated sequences, such as offensiveness, social bias, harmfulness or
falsehood. The RL formulation involves treating the LM as a policy and updating
it to maximise the expected value of a reward function which captures human
preferences, such as non-offensiveness. In this paper, we analyze challenges
associated with treating a language model as an RL policy and show how avoiding
those challenges requires moving beyond the RL paradigm. We start by observing
that the standard RL approach is flawed as an objective for fine-tuning LMs
because it leads to distribution collapse: turning the LM into a degenerate
distribution. Then, we analyze KL-regularised RL, a widely used recipe for
fine-tuning LMs, which additionally constrains the fine-tuned LM to stay close
to its original distribution in terms of Kullback-Leibler (KL) divergence. We
show that KL-regularised RL is equivalent to variational inference:
approximating a Bayesian posterior which specifies how to update a prior LM to
conform with evidence provided by the reward function. We argue that this
Bayesian inference view of KL-regularised RL is more insightful than the
typically employed RL perspective. The Bayesian inference view explains how
KL-regularised RL avoids the distribution collapse problem and offers a
first-principles derivation for its objective. While this objective happens to
be equivalent to RL (with a particular choice of parametric reward), there
exist other objectives for fine-tuning LMs which are no longer equivalent to
RL. That observation leads to a more general point: RL is not an adequate
formal framework for problems such as fine-tuning language models. These
problems are best viewed as Bayesian inference: approximating a pre-defined
target distribution.
Related papers
- On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization [33.331389392270665]
preference matching (PM) RLHF is a novel approach that aligns large language models with the preference distribution of the reward model under the Bradley--Terry--Luce/Plackett--Luce model.
Central to our approach is a PM regularizer that takes the form of the negative logarithm of the LLM's policy probability distribution over responses.
For practical implementation, we introduce a conditional variant of PM RLHF that is tailored to natural language generation.
arXiv Detail & Related papers (2024-05-26T07:00:05Z) - More Benefits of Being Distributional: Second-Order Bounds for
Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL.
Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z) - Policy Evaluation in Distributional LQR [70.63903506291383]
We provide a closed-form expression of the distribution of the random return.
We show that this distribution can be approximated by a finite number of random variables.
Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR.
arXiv Detail & Related papers (2023-03-23T20:27:40Z) - LCRL: Certified Policy Synthesis via Logically-Constrained Reinforcement
Learning [78.2286146954051]
LCRL implements model-free Reinforcement Learning (RL) algorithms over unknown Decision Processes (MDPs)
We present case studies to demonstrate the applicability, ease of use, scalability, and performance of LCRL.
arXiv Detail & Related papers (2022-09-21T13:21:00Z) - BADDr: Bayes-Adaptive Deep Dropout RL for POMDPs [22.78390558602203]
We present a representation-agnostic formulation of BRL under partially observability, unifying the previous models under one theoretical umbrella.
We also propose a novel derivation, Bayes-Adaptive Deep Dropout rl (BADDr), based on dropout networks.
arXiv Detail & Related papers (2022-02-17T19:48:35Z) - Challenging Common Assumptions in Convex Reinforcement Learning [34.739021482682176]
We show that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error.
We believe shedding light on this issue will lead to better approaches and methodologies for convex RL.
arXiv Detail & Related papers (2022-02-03T10:47:10Z) - Regularization Guarantees Generalization in Bayesian Reinforcement
Learning through Algorithmic Stability [48.62272919754204]
We study generalization in Bayesian RL under the probably approximately correct (PAC) framework.
Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense.
arXiv Detail & Related papers (2021-09-24T07:48:34Z) - Instabilities of Offline RL with Pre-Trained Neural Representation [127.89397629569808]
In offline reinforcement learning (RL), we seek to utilize offline data to evaluate (or learn) policies in scenarios where the data are collected from a distribution that substantially differs from that of the target policy to be evaluated.
Recent theoretical advances have shown that such sample-efficient offline RL is indeed possible provided certain strong representational conditions hold.
This work studies these issues from an empirical perspective to gauge how stable offline RL methods are.
arXiv Detail & Related papers (2021-03-08T18:06:44Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z) - Leverage the Average: an Analysis of KL Regularization in RL [44.01222241795292]
We show that Kullback-Leibler (KL) regularization implicitly averages q-values.
We provide a very strong performance bound, the very first to combine two desirable aspects.
Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study.
arXiv Detail & Related papers (2020-03-31T10:55:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.