DR3: Value-Based Deep Reinforcement Learning Requires Explicit
Regularization
- URL: http://arxiv.org/abs/2112.04716v1
- Date: Thu, 9 Dec 2021 06:01:01 GMT
- Title: DR3: Value-Based Deep Reinforcement Learning Requires Explicit
Regularization
- Authors: Aviral Kumar, Rishabh Agarwal, Tengyu Ma, Aaron Courville, George
Tucker, Sergey Levine
- Abstract summary: We discuss how the implicit regularization effect of SGD seen in supervised learning could in fact be harmful in the offline deep RL.
Our theoretical analysis shows that when existing models of implicit regularization are applied to temporal difference learning, the resulting derived regularizer favors degenerate solutions.
We propose a simple and effective explicit regularizer, called DR3, that counteracts the undesirable effects of this implicit regularizer.
- Score: 125.5448293005647
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite overparameterization, deep networks trained via supervised learning
are easy to optimize and exhibit excellent generalization. One hypothesis to
explain this is that overparameterized deep networks enjoy the benefits of
implicit regularization induced by stochastic gradient descent, which favors
parsimonious solutions that generalize well on test inputs. It is reasonable to
surmise that deep reinforcement learning (RL) methods could also benefit from
this effect. In this paper, we discuss how the implicit regularization effect
of SGD seen in supervised learning could in fact be harmful in the offline deep
RL setting, leading to poor generalization and degenerate feature
representations. Our theoretical analysis shows that when existing models of
implicit regularization are applied to temporal difference learning, the
resulting derived regularizer favors degenerate solutions with excessive
"aliasing", in stark contrast to the supervised learning case. We back up these
findings empirically, showing that feature representations learned by a deep
network value function trained via bootstrapping can indeed become degenerate,
aliasing the representations for state-action pairs that appear on either side
of the Bellman backup. To address this issue, we derive the form of this
implicit regularizer and, inspired by this derivation, propose a simple and
effective explicit regularizer, called DR3, that counteracts the undesirable
effects of this implicit regularizer. When combined with existing offline RL
methods, DR3 substantially improves performance and stability, alleviating
unlearning in Atari 2600 games, D4RL domains and robotic manipulation from
images.
Related papers
- IR$^3$: Contrastive Inverse Reinforcement Learning for Interpretable Detection and Mitigation of Reward Hacking [67.20568716300272]
Reinforcement Learning from Human Feedback (RLHF) enables powerful LLM alignment but can introduce reward hacking.<n>We introduce IR3 (Interpretable Reward Reconstruction and Rectification), a framework that reverse-engineers, interprets, and surgically repairs the implicit objectives driving RLHF-tuned models.<n>We show that IR3 achieves 0.89 correlation with ground-truth rewards, identifies hacking features with over 90% precision, and significantly reduces hacking behaviors while maintaining capabilities within 3% of the original model.
arXiv Detail & Related papers (2026-02-23T01:14:53Z) - Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning [88.42566960813438]
CalibRL is a hybrid-policy RLVR framework that supports controllable exploration with expert guidance.<n>CalibRL increases policy entropy in a guided manner and clarifies the target distribution.<n>Experiments across eight benchmarks, including both in-domain and out-of-domain settings, demonstrate consistent improvements.
arXiv Detail & Related papers (2026-02-22T07:23:36Z) - Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models [13.32858759983739]
Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs.<n>Existing inference-time interventions to mitigate this issue present a challenging trade-off.<n>We present Residual-Update Directed DEcoding Regulation (RUDDER), a framework that steers LVLMs towards visually-grounded generation.
arXiv Detail & Related papers (2025-11-13T13:29:38Z) - SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning [48.41770886055744]
Federated Learning has seen an increased deployment in real-world scenarios recently.<n>The introduction of the so-called gradient inversion attacks has challenged its privacy-preserving properties.<n>We introduce SPEAR, which is based on a theoretical analysis of the gradients of linear layers with ReLU activations.<n>Our new attack, SPEAR++, retains all desirable properties of SPEAR, such as robustness to DP noise and FedAvg aggregation.
arXiv Detail & Related papers (2025-10-28T09:06:19Z) - SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract.
We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z) - IMEX-Reg: Implicit-Explicit Regularization in the Function Space for Continual Learning [17.236861687708096]
Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge.
Inspired by how humans learn using strong inductive biases, we propose IMEX-Reg to improve the generalization performance of experience rehearsal in CL under low buffer regimes.
arXiv Detail & Related papers (2024-04-28T12:25:09Z) - On Reducing Undesirable Behavior in Deep Reinforcement Learning Models [0.0]
We propose a novel framework aimed at drastically reducing the undesirable behavior of DRL-based software.
Our framework can assist in providing engineers with a comprehensible characterization of such undesirable behavior.
arXiv Detail & Related papers (2023-09-06T09:47:36Z) - An Empirical Study of Implicit Regularization in Deep Offline RL [44.62587507925864]
We study the relation between effective rank and performance on three offline RL datasets.
We identify three phases of learning that explain the impact of implicit regularization on the learning dynamics.
arXiv Detail & Related papers (2022-07-05T15:07:31Z) - Stabilizing Off-Policy Deep Reinforcement Learning from Pixels [9.998078491879145]
Off-policy reinforcement learning from pixel observations is notoriously unstable.
We show that these instabilities arise from performing temporal-difference learning with a convolutional encoder and low-magnitude rewards.
We propose A-LIX, a method providing adaptive regularization to the encoder's gradients that explicitly prevents the occurrence of catastrophic self-overfitting.
arXiv Detail & Related papers (2022-07-03T08:52:40Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - Stochastic Training is Not Necessary for Generalization [57.04880404584737]
It is widely believed that the implicit regularization of gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks.
In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD.
arXiv Detail & Related papers (2021-09-29T00:50:00Z) - Implicit Under-Parameterization Inhibits Data-Efficient Deep
Reinforcement Learning [97.28695683236981]
More gradient updates decrease the expressivity of the current value network.
We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings.
arXiv Detail & Related papers (2020-10-27T17:55:16Z) - Overfitting in adversarially robust deep learning [86.11788847990783]
We show that overfitting to the training set does in fact harm robust performance to a very large degree in adversarially robust training.
We also show that effects such as the double descent curve do still occur in adversarially trained models, yet fail to explain the observed overfitting.
arXiv Detail & Related papers (2020-02-26T15:40:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.