Learning Expected Emphatic Traces for Deep RL
- URL: http://arxiv.org/abs/2107.05405v1
- Date: Mon, 12 Jul 2021 13:14:03 GMT
- Title: Learning Expected Emphatic Traces for Deep RL
- Authors: Ray Jiang, Shangtong Zhang, Veronica Chelu, Adam White, Hado van
Hasselt
- Abstract summary: Off-policy sampling and experience replay are key for improving sample efficiency and scaling model-free temporal difference learning methods.
We develop a multi-step emphatic weighting that can be combined with replay, and a time-reversed $n$-step TD learning algorithm to learn the required emphatic weighting.
- Score: 32.984880782688535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy sampling and experience replay are key for improving sample
efficiency and scaling model-free temporal difference learning methods. When
combined with function approximation, such as neural networks, this combination
is known as the deadly triad and is potentially unstable. Recently, it has been
shown that stability and good performance at scale can be achieved by combining
emphatic weightings and multi-step updates. This approach, however, is
generally limited to sampling complete trajectories in order, to compute the
required emphatic weighting. In this paper we investigate how to combine
emphatic weightings with non-sequential, off-line data sampled from a replay
buffer. We develop a multi-step emphatic weighting that can be combined with
replay, and a time-reversed $n$-step TD learning algorithm to learn the
required emphatic weighting. We show that these state weightings reduce
variance compared with prior approaches, while providing convergence
guarantees. We tested the approach at scale on Atari 2600 video games, and
observed that the new X-ETD($n$) agent improved over baseline agents,
highlighting both the scalability and broad applicability of our approach.
Related papers
- Stabilizing Linear Passive-Aggressive Online Learning with Weighted Reservoir Sampling [46.01254613933967]
Online learning methods are still highly effective for high-dimensional streaming data, out-of-core processing, and other throughput-sensitive applications.
Many such algorithms rely on fast adaptation to individual errors as a key to their convergence.
While such algorithms enjoy low theoretical regret, in real-world deployment they can be sensitive to individual outliers that cause the algorithm to over-correct.
arXiv Detail & Related papers (2024-10-31T03:35:48Z) - Enhancing Consistency and Mitigating Bias: A Data Replay Approach for
Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.
To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks.
However, it is not expected in practice considering the memory constraint or data privacy issue.
As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - Understanding the effect of varying amounts of replay per step [0.0]
We study the effect of varying amounts of replay per step in a well-known model-free algorithm: Deep Q-Network (DQN) in the Mountain Car environment.
arXiv Detail & Related papers (2023-02-20T20:54:11Z) - A Data-Centric Approach for Improving Adversarial Training Through the
Lens of Out-of-Distribution Detection [0.4893345190925178]
We propose detecting and removing hard samples directly from the training procedure rather than applying complicated algorithms to mitigate their effects.
Our results on SVHN and CIFAR-10 datasets show the effectiveness of this method in improving the adversarial training without adding too much computational cost.
arXiv Detail & Related papers (2023-01-25T08:13:50Z) - Look Back When Surprised: Stabilizing Reverse Experience Replay for
Neural Approximation [7.6146285961466]
We consider the recently developed and theoretically rigorous reverse experience replay (RER)
We show via experiments that this has a better performance than techniques like prioritized experience replay (PER) on various tasks.
arXiv Detail & Related papers (2022-06-07T10:42:02Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Accelerated Convergence for Counterfactual Learning to Rank [65.63997193915257]
We show that convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights.
We propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods.
We prove that CounterSample converges faster and complement our theoretical findings with empirical results.
arXiv Detail & Related papers (2020-05-21T12:53:36Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.