DR3: Value-Based Deep Reinforcement Learning Requires Explicit
Regularization
- URL: http://arxiv.org/abs/2112.04716v1
- Date: Thu, 9 Dec 2021 06:01:01 GMT
- Title: DR3: Value-Based Deep Reinforcement Learning Requires Explicit
Regularization
- Authors: Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George
Tucker, Sergey Levine
- Abstract summary: We discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL.
Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions.
We propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer.
- Score: 125.5448293005647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite overparameterization, deep networks trained via supervised learning
are easy to optimize and exhibit excellent generalization. One hypothesis to
explain this is that overparameterized deep networks enjoy the benefits of
implicit regularization induced by stochastic gradient descent, which favors
parsimonious solutions that generalize well on test inputs. It is reasonable to
surmise that deep reinforcement learning (RL) methods could also benefit from
this effect. In this paper, we discuss how the implicit regularization effect
of SGD seen in supervised learning could in fact be harmful in the offline deep
RL setting, leading to poor generalization and degenerate feature
representations. Our theoretical analysis shows that when existing models of
implicit regularization are applied to temporal difference learning, the
resulting derived regularizer favors degenerate solutions with excessive
"aliasing", in stark contrast to the supervised learning case. We back up these
findings empirically, showing that feature representations learned by a deep
network value function trained via bootstrapping can indeed become degenerate,
aliasing the representations for state-action pairs that appear on either side
of the Bellman backup. To address this issue, we derive the form of this
implicit regularizer and, inspired by this derivation, propose a simple and
effective explicit regularizer, called DR3, that counteracts the undesirable
effects of this implicit regularizer. When combined with existing offline RL
methods, DR3 substantially improves performance and stability, alleviating
unlearning in Atari 2600 games, D4RL domains and robotic manipulation from
images.
Related papers
- The Role of Deep Learning Regularizations on Actors in Offline RL [1.2744523252873352]
Regularization techniques, such as dropout, layer normalization, or weight decay, are widely adopted in the construction of modern artificial neural networks.
In the domain of Reinforcement Learning (RL), the application of these techniques has been limited.
We show that applying standard regularization techniques to actor networks in offline RL actor-critic algorithms yields improvements of 6% on average.
arXiv Detail & Related papers (2024-09-11T20:35:29Z) - SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract.
We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z) - On Reducing Undesirable Behavior in Deep Reinforcement Learning Models [0.0]
We propose a novel framework aimed at drastically reducing the undesirable behavior of DRL-based software.
Our framework can assist in providing engineers with a comprehensible characterization of such undesirable behavior.
arXiv Detail & Related papers (2023-09-06T09:47:36Z) - An Empirical Study of Implicit Regularization in Deep Offline RL [44.62587507925864]
We study the relation between effective rank and performance on three offline RL datasets.
We identify three phases of learning that explain the impact of implicit regularization on the learning dynamics.
arXiv Detail & Related papers (2022-07-05T15:07:31Z) - Stabilizing Off-Policy Deep Reinforcement Learning from Pixels [9.998078491879145]
Off-policy reinforcement learning from pixel observations is notoriously unstable.
We show that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards.
We propose A-LIX, a method providing adaptive regularization to the encoder's gradients that explicitly prevents the occurrence of catastrophic self-overfitting.
arXiv Detail & Related papers (2022-07-03T08:52:40Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - Implicit Under-Parameterization Inhibits Data-Efficient Deep
Reinforcement Learning [97.28695683236981]
More gradient updates decrease the expressivity of the current value network.
We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings.
arXiv Detail & Related papers (2020-10-27T17:55:16Z) - Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training.
We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.