Related papers: Experience Replay with Likelihood-free Importance Weights

Experience Replay with Likelihood-free Importance Weights

URL: http://arxiv.org/abs/2006.13169v1
Date: Tue, 23 Jun 2020 17:17:44 GMT
Title: Experience Replay with Likelihood-free Importance Weights
Authors: Samarth Sinha and Jiaming Song and Animesh Garg and Stefano Ermon
Abstract summary: We propose to reweight experiences based on their likelihood under the stationary distribution of the current policy. We apply the proposed approach empirically on two competitive methods, Soft Actor Critic (SAC) and Twin Delayed Deep Deterministic policy gradient (TD3)
Score: 123.52005591531194
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The use of past experiences to accelerate temporal difference (TD) learning of value functions, or experience replay, is a key component in deep reinforcement learning. Prioritization or reweighting of important experiences has shown to improve performance of TD learning algorithms.In this work, we propose to reweight experiences based on their likelihood under the stationary distribution of the current policy. Using the corresponding reweighted TD objective, we implicitly encourage small approximation errors on the value function over frequently encountered states. We use a likelihood-free density ratio estimator over the replay buffer to assign the prioritization weights. We apply the proposed approach empirically on two competitive methods, Soft Actor Critic (SAC) and Twin Delayed Deep Deterministic policy gradient (TD3) -- over a suite of OpenAI gym tasks and achieve superior sample complexity compared to other baseline approaches.

Related papers

Reward Prediction Error Prioritisation in Experience Replay: The RPE-PER Method [1.600323605807673]
We introduce Reward Predictive Error Prioritised Experience Replay (RPE-PER) RPE-PER prioritises experiences in the buffer based on RPEs. Our method employs a critic network, EMCN, that predicts rewards in addition to the Q-values produced by standard critic networks.
arXiv Detail & Related papers (2025-01-30T02:09:35Z)
ROER: Regularized Optimal Experience Replay [34.462315999611256]
Prioritized experience replay (PER) reweights experiences by the temporal difference (TD) error. We show the connections between the experience prioritization and occupancy optimization. Regularized optimal experience replay (ROER) achieves noticeable improvement on difficult Antmaze environment.
arXiv Detail & Related papers (2024-07-04T15:14:57Z)
Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [100.7407460674153]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks. To mitigate the problem, a line of methods propose to replay the data of experienced tasks when learning new tasks. However, it is not expected in practice considering the memory constraint or data privacy issue. As a replacement, data-free data replay methods are proposed by inverting samples from the classification model.
arXiv Detail & Related papers (2024-01-12T12:51:12Z)
Directly Attention Loss Adjusted Prioritized Experience Replay [0.07366405857677226]
Prioritized Replay Experience (PER) enables the model to learn more about relatively important samples by artificially changing their accessed frequencies. DALAP is proposed, which can directly quantify the changed extent of the shifted distribution through Parallel Self-Attention network.
arXiv Detail & Related papers (2023-11-24T10:14:05Z)
Attention Loss Adjusted Prioritized Experience Replay [0.0]
Prioritized Replay Experience (PER) is a technical means of deep reinforcement learning by selecting experience samples with more knowledge quantity to improve the training rate of neural network. Non-uniform sampling used in PER inevitably shifts the state-action space distribution and brings the estimation error of Q-value function. An Attention Loss Adjusted Prioritized (ALAP) Experience Replay algorithm is proposed, which integrates the improved Self-Attention network with Double-Sampling mechanism.
arXiv Detail & Related papers (2023-09-13T02:49:32Z)
Safe and Robust Experience Sharing for Deterministic Policy Gradient Algorithms [0.0]
We introduce a simple yet effective experience sharing mechanism for deterministic policies in continuous action domains. We facilitate our algorithm with a novel off-policy correction technique without any action probability estimates. We test the effectiveness of our method in challenging OpenAI Gym continuous control tasks and conclude that it can achieve a safe experience sharing across multiple agents.
arXiv Detail & Related papers (2022-07-27T11:10:50Z)
SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation. In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor. Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
Replay For Safety [51.11953997546418]
In experience replay, past transitions are stored in a memory buffer and re-used during learning. We show that using an appropriate biased sampling scheme can allow us to achieve a emphsafe policy.
arXiv Detail & Related papers (2021-12-08T11:10:57Z)
Revisiting Fundamentals of Experience Replay [91.24213515992595]
We present a systematic and extensive analysis of experience replay in Q-learning methods. We focus on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected.
arXiv Detail & Related papers (2020-07-13T21:22:17Z)
DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.