Leverage the Average: an Analysis of KL Regularization in RL
- URL: http://arxiv.org/abs/2003.14089v5
- Date: Wed, 6 Jan 2021 14:12:57 GMT
- Title: Leverage the Average: an Analysis of KL Regularization in RL
- Authors: Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin,
R\'emi Munos, Matthieu Geist
- Abstract summary: We show that Kullback-Leibler (KL) regularization implicitly averages q-values.
We provide a very strong performance bound, the very first to combine two desirable aspects.
Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study.
- Score: 44.01222241795292
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler
(KL) regularization as a core component have shown outstanding performance.
Yet, only little is understood theoretically about why KL regularization helps,
so far. We study KL regularization within an approximate value iteration scheme
and show that it implicitly averages q-values. Leveraging this insight, we
provide a very strong performance bound, the very first to combine two
desirable aspects: a linear dependency to the horizon (instead of quadratic)
and an error propagation term involving an averaging effect of the estimation
errors (instead of an accumulation effect). We also study the more general case
of an additional entropy regularizer. The resulting abstract scheme encompasses
many existing RL algorithms. Some of our assumptions do not hold with neural
networks, so we complement this theoretical analysis with an extensive
empirical study.
Related papers
- Near-Optimal Sample Complexity in Reward-Free Kernel-Based Reinforcement Learning [17.508280208015943]
We ask how many samples are required to design a near-optimal policy in kernel-based RL.
Existing work addresses this question under restrictive assumptions about the class of kernel functions.
We tackle this fundamental problem using a broad class of kernels and a simpler algorithm compared to prior work.
arXiv Detail & Related papers (2025-02-11T17:15:55Z) - Logarithmic Regret for Online KL-Regularized Reinforcement Learning [51.113248212150964]
KL-regularization plays a pivotal role in improving efficiency of RL fine-tuning for large language models.
Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored.
We propose an optimistic-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret.
arXiv Detail & Related papers (2025-02-11T11:11:05Z) - Sharp Analysis for KL-Regularized Contextual Bandits and RLHF [52.519416266840814]
Reverse-Kullback-Leibler (KL) regularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning.
We show that a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient.
Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.
arXiv Detail & Related papers (2024-11-07T11:22:46Z) - A Statistical Theory of Regularization-Based Continual Learning [10.899175512941053]
We provide a statistical analysis of regularization-based continual learning on a sequence of linear regression tasks.
We first derive the convergence rate for the oracle estimator obtained as if all data were available simultaneously.
A byproduct of our theoretical analysis is the equivalence between early stopping and generalized $ell$-regularization.
arXiv Detail & Related papers (2024-06-10T12:25:13Z) - Sparsest Univariate Learning Models Under Lipschitz Constraint [31.28451181040038]
We propose continuous-domain formulations for one-dimensional regression problems.
We control the Lipschitz constant explicitly using a user-defined upper-bound.
We show that both problems admit global minimizers that are continuous and piecewise-linear.
arXiv Detail & Related papers (2021-12-27T07:03:43Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - Approximation Schemes for ReLU Regression [80.33702497406632]
We consider the fundamental problem of ReLU regression.
The goal is to output the best fitting ReLU with respect to square loss given to draws from some unknown distribution.
arXiv Detail & Related papers (2020-05-26T16:26:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.