Correcting Momentum in Temporal Difference Learning
- URL: http://arxiv.org/abs/2106.03955v1
- Date: Mon, 7 Jun 2021 20:41:15 GMT
- Title: Correcting Momentum in Temporal Difference Learning
- Authors: Emmanuel Bengio, Joelle Pineau, Doina Precup
- Abstract summary: We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale.
We show that this phenomenon exists, and then propose a first-order correction term to momentum.
An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
- Score: 95.62766731469671
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A common optimization tool used in deep reinforcement learning is momentum,
which consists in accumulating and discounting past gradients, reapplying them
at each iteration. We argue that, unlike in supervised learning, momentum in
Temporal Difference (TD) learning accumulates gradients that become doubly
stale: not only does the gradient of the loss change due to parameter updates,
the loss itself changes due to bootstrapping. We first show that this
phenomenon exists, and then propose a first-order correction term to momentum.
We show that this correction term improves sample efficiency in policy
evaluation by correcting target value drift. An important insight of this work
is that deep RL methods are not always best served by directly importing
techniques from the supervised setting.
Related papers
- An Effective Dynamic Gradient Calibration Method for Continual Learning [11.555822066922508]
Continual learning (CL) is a fundamental topic in machine learning, where the goal is to train a model with continuously incoming data and tasks.
Due to the memory limit, we cannot store all the historical data, and therefore confront the catastrophic forgetting'' problem.
We develop an effective algorithm to calibrate the gradient in each updating step of the model.
arXiv Detail & Related papers (2024-07-30T16:30:09Z) - Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature.
We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate.
We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z) - ODICE: Revealing the Mystery of Distribution Correction Estimation via
Orthogonal-gradient Update [43.91666113724066]
We investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL)
DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning.
We find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state)
arXiv Detail & Related papers (2024-02-01T05:30:51Z) - DPSUR: Accelerating Differentially Private Stochastic Gradient Descent
Using Selective Update and Release [29.765896801370612]
This paper proposes Differentially Private training framework based on Selective Updates and Release.
The main challenges lie in two aspects -- privacy concerns, and gradient selection strategy for model update.
Experiments conducted on MNIST, FMNIST, CIFAR-10, and IMDB datasets show that DPSUR significantly outperforms previous works in terms of convergence speed.
arXiv Detail & Related papers (2023-11-23T15:19:30Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Implicit Under-Parameterization Inhibits Data-Efficient Deep
Reinforcement Learning [97.28695683236981]
More gradient updates decrease the expressivity of the current value network.
We demonstrate this phenomenon on Atari and Gym benchmarks, in both offline and online RL settings.
arXiv Detail & Related papers (2020-10-27T17:55:16Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.