Related papers: DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization

DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization

URL: http://arxiv.org/abs/2112.04716v1
Date: Thu, 9 Dec 2021 06:01:01 GMT
Title: DR3: Value-Based Deep Reinforcement Learning Requires Explicit Regularization
Authors: Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George Tucker, Sergey Levine
Abstract summary: We discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions. We propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer.
Score: 125.5448293005647
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite overparameterization, deep networks trained via supervised learning are easy to optimize and exhibit excellent generalization. One hypothesis to explain this is that overparameterized deep networks enjoy the benefits of implicit regularization induced by stochastic gradient descent, which favors parsimonious solutions that generalize well on test inputs. It is reasonable to surmise that deep reinforcement learning (RL) methods could also benefit from this effect. In this paper, we discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL setting, leading to poor generalization and degenerate feature representations. Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions with excessive "aliasing", in stark contrast to the supervised learning case. We back up these findings empirically, showing that feature representations learned by a deep network value function trained via bootstrapping can indeed become degenerate, aliasing the representations for state-action pairs that appear on either side of the Bellman backup. To address this issue, we derive the form of this implicit regularizer and, inspired by this derivation, propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer. When combined with existing offline RL methods, DR3 substantially improves performance and stability, alleviating unlearning in Atari 2600 games, D4RL domains and robotic manipulation from images.

Related papers

SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract. We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z)
IMEX-Reg: Implicit-Explicit Regularization in the Function Space for Continual Learning [17.236861687708096]
Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge. Inspired by how humans learn using strong inductive biases, we propose IMEX-Reg to improve the generalization performance of experience rehearsal in CL under low buffer regimes.
arXiv Detail & Related papers (2024-04-28T12:25:09Z)
On Reducing Undesirable Behavior in Deep Reinforcement Learning Models [0.0]
We propose a novel framework aimed at drastically reducing the undesirable behavior of DRL-based software. Our framework can assist in providing engineers with a comprehensible characterization of such undesirable behavior.
arXiv Detail & Related papers (2023-09-06T09:47:36Z)
An Empirical Study of Implicit Regularization in Deep Offline RL [44.62587507925864]
We study the relation between effective rank and performance on three offline RL datasets. We identify three phases of learning that explain the impact of implicit regularization on the learning dynamics.
arXiv Detail & Related papers (2022-07-05T15:07:31Z)
Stabilizing Off-Policy Deep Reinforcement Learning from Pixels [9.998078491879145]
Off-policy reinforcement learning from pixel observations is notoriously unstable. We show that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards. We propose A-LIX, a method providing adaptive regularization to the encoder's gradients that explicitly prevents the occurrence of catastrophic self-overfitting.
arXiv Detail & Related papers (2022-07-03T08:52:40Z)
Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning. Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case. We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z)
Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z)
Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning [97.28695683236981]
More gradient updates decrease the expressivity of the current value network. We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings.
arXiv Detail & Related papers (2020-10-27T17:55:16Z)
Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training. We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.