ODICE: Revealing the Mystery of Distribution Correction Estimation via
Orthogonal-gradient Update
- URL: http://arxiv.org/abs/2402.00348v1
- Date: Thu, 1 Feb 2024 05:30:51 GMT
- Title: ODICE: Revealing the Mystery of Distribution Correction Estimation via
Orthogonal-gradient Update
- Authors: Liyuan Mao, Haoran Xu, Weinan Zhang, Xianyuan Zhan
- Abstract summary: We investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL)
DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning.
We find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state)
- Score: 43.91666113724066
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this study, we investigate the DIstribution Correction Estimation (DICE)
methods, an important line of work in offline reinforcement learning (RL) and
imitation learning (IL). DICE-based methods impose state-action-level behavior
constraint, which is an ideal choice for offline learning. However, they
typically perform much worse than current state-of-the-art (SOTA) methods that
solely use action-level behavior constraint. After revisiting DICE-based
methods, we find there exist two gradient terms when learning the value
function using true-gradient update: forward gradient (taken on the current
state) and backward gradient (taken on the next state). Using forward gradient
bears a large similarity to many offline RL methods, and thus can be regarded
as applying action-level constraint. However, directly adding the backward
gradient may degenerate or cancel out its effect if these two gradients have
conflicting directions. To resolve this issue, we propose a simple yet
effective modification that projects the backward gradient onto the normal
plane of the forward gradient, resulting in an orthogonal-gradient update, a
new learning rule for DICE-based methods. We conduct thorough theoretical
analyses and find that the projected backward gradient brings state-level
behavior regularization, which reveals the mystery of DICE-based methods: the
value learning objective does try to impose state-action-level constraint, but
needs to be used in a corrected way. Through toy examples and extensive
experiments on complex offline RL and IL tasks, we demonstrate that DICE-based
methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and
great robustness.
Related papers
- An Effective Dynamic Gradient Calibration Method for Continual Learning [11.555822066922508]
Continual learning (CL) is a fundamental topic in machine learning, where the goal is to train a model with continuously incoming data and tasks.
Due to the memory limit, we cannot store all the historical data, and therefore confront the catastrophic forgetting'' problem.
We develop an effective algorithm to calibrate the gradient in each updating step of the model.
arXiv Detail & Related papers (2024-07-30T16:30:09Z) - One-Step Forward and Backtrack: Overcoming Zig-Zagging in Loss-Aware
Quantization Training [12.400950982075948]
Weight quantization is an effective technique to compress deep neural networks for their deployment on edge devices with limited resources.
Traditional loss-aware quantization methods commonly use the quantized gradient to replace the full-precision gradient.
This paper proposes a one-step forward and backtrack way for loss-aware quantization to get more accurate and stable gradient direction.
arXiv Detail & Related papers (2024-01-30T05:42:54Z) - Class Gradient Projection For Continual Learning [99.105266615448]
Catastrophic forgetting is one of the most critical challenges in Continual Learning (CL)
We propose Class Gradient Projection (CGP), which calculates the gradient subspace from individual classes rather than tasks.
arXiv Detail & Related papers (2023-11-25T02:45:56Z) - A Simple Solution for Offline Imitation from Observations and Examples
with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable.
We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z) - Gradient Correction beyond Gradient Descent [63.33439072360198]
gradient correction is apparently the most crucial aspect for the training of a neural network.
We introduce a framework (textbfGCGD) to perform gradient correction.
Experiment results show that our gradient correction framework can effectively improve the gradient quality to reduce training epochs by $sim$ 20% and also improve the network performance.
arXiv Detail & Related papers (2022-03-16T01:42:25Z) - On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient.
Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z) - Correcting Momentum in Temporal Difference Learning [95.62766731469671]
We argue that momentum in Temporal Difference (TD) learning accumulates gradients that become doubly stale.
We show that this phenomenon exists, and then propose a first-order correction term to momentum.
An important insight of this work is that deep RL methods are not always best served by directly importing techniques from the supervised setting.
arXiv Detail & Related papers (2021-06-07T20:41:15Z) - SSGD: A safe and efficient method of gradient descent [0.5099811144731619]
gradient descent method plays an important role in solving various optimization problems.
Super gradient descent approach to update parameters by concealing the length of gradient.
Our algorithm can defend against attacks on the gradient.
arXiv Detail & Related papers (2020-12-03T17:09:20Z) - Accumulated Decoupled Learning: Mitigating Gradient Staleness in
Inter-Layer Model Parallelization [16.02377434191239]
We propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect.
We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature.
The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.
arXiv Detail & Related papers (2020-12-03T11:52:55Z) - Sample Efficient Reinforcement Learning with REINFORCE [10.884278019498588]
We consider classical policy gradient methods and the widely-used REINFORCE estimation procedure.
By controlling number of "bad" episodes, we establish an anytime sub-linear high regret bound as well as almost sure global convergence of the average regret with anally sub-linear rate.
These provide the first set of global convergence and sample efficiency results for the well-known REINFORCE algorithm and contribute to a better understanding of its performance in practice.
arXiv Detail & Related papers (2020-10-22T01:02:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.