Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms
via Batch Prioritized Experience Replay
- URL: http://arxiv.org/abs/2111.01865v1
- Date: Tue, 2 Nov 2021 19:51:59 GMT
- Title: Off-Policy Correction for Deep Deterministic Policy Gradient Algorithms
via Batch Prioritized Experience Replay
- Authors: Dogan C. Cicek, Enes Duran, Baturay Saglam, Furkan B. Mutlu, Suleyman
S. Kozat
- Abstract summary: We develop a novel algorithm, Batch Prioritizing Experience Replay via KL Divergence, which prioritizes batch of transitions.
We combine our algorithm with Deep Deterministic Policy Gradient and Twin Delayed Deep Deterministic Policy Gradient and evaluate it on various continuous control tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The experience replay mechanism allows agents to use the experiences multiple
times. In prior works, the sampling probability of the transitions was adjusted
according to their importance. Reassigning sampling probabilities for every
transition in the replay buffer after each iteration is highly inefficient.
Therefore, experience replay prioritization algorithms recalculate the
significance of a transition when the corresponding transition is sampled to
gain computational efficiency. However, the importance level of the transitions
changes dynamically as the policy and the value function of the agent are
updated. In addition, experience replay stores the transitions are generated by
the previous policies of the agent that may significantly deviate from the most
recent policy of the agent. Higher deviation from the most recent policy of the
agent leads to more off-policy updates, which is detrimental for the agent. In
this paper, we develop a novel algorithm, Batch Prioritizing Experience Replay
via KL Divergence (KLPER), which prioritizes batch of transitions rather than
directly prioritizing each transition. Moreover, to reduce the off-policyness
of the updates, our algorithm selects one batch among a certain number of
batches and forces the agent to learn through the batch that is most likely
generated by the most recent policy of the agent. We combine our algorithm with
Deep Deterministic Policy Gradient and Twin Delayed Deep Deterministic Policy
Gradient and evaluate it on various continuous control tasks. KLPER provides
promising improvements for deep deterministic continuous control algorithms in
terms of sample efficiency, final performance, and stability of the policy
during the training.
Related papers
- CUER: Corrected Uniform Experience Replay for Off-Policy Continuous Deep Reinforcement Learning Algorithms [5.331052581441265]
We develop a novel algorithm, Corrected Uniform Experience (CUER), which samples the stored experience while considering the fairness among all other experiences.
CUER provides promising improvements for off-policy continuous control algorithms in terms of sample efficiency, final performance, and stability of the policy during the training.
arXiv Detail & Related papers (2024-06-13T12:03:40Z) - MAC-PO: Multi-Agent Experience Replay via Collective Priority
Optimization [12.473095790918347]
We propose name, which formulates optimal prioritized experience replay for multi-agent problems.
By minimizing the resulting policy regret, we can narrow the gap between the current policy and a nominal optimal policy.
arXiv Detail & Related papers (2023-02-21T03:11:21Z) - Improving the Efficiency of Off-Policy Reinforcement Learning by
Accounting for Past Decisions [20.531576904743282]
Off-policy estimation bias is corrected in a per-decision manner.
Off-policy algorithms such as Tree Backup and Retrace rely on this mechanism.
We propose a multistep operator that permits arbitrary past-dependent traces.
arXiv Detail & Related papers (2021-12-23T00:07:28Z) - Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement
Learning with Actor Rectification [74.10976684469435]
offline reinforcement learning (RL) algorithms can be transferred to multi-agent settings directly.
We propose a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), to tackle this critical challenge.
OMAR significantly outperforms strong baselines with state-of-the-art performance in multi-agent continuous control benchmarks.
arXiv Detail & Related papers (2021-11-22T13:27:42Z) - Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with
On-Policy Experience [9.06635747612495]
Soft Actor-Critic (SAC) is an off-policy actor-critic reinforcement learning algorithm.
SAC trains a policy by maximizing the trade-off between expected return and entropy.
It has achieved state-of-the-art performance on a range of continuous-control benchmark tasks.
arXiv Detail & Related papers (2021-09-24T06:46:28Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - DDPG++: Striving for Simplicity in Continuous-control Off-Policy
Reinforcement Learning [95.60782037764928]
We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
arXiv Detail & Related papers (2020-06-26T20:21:12Z) - Optimizing for the Future in Non-Stationary MDPs [52.373873622008944]
We present a policy gradient algorithm that maximizes a forecast of future performance.
We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques.
arXiv Detail & Related papers (2020-05-17T03:41:19Z) - Multiagent Value Iteration Algorithms in Dynamic Programming and
Reinforcement Learning [0.0]
We consider infinite horizon dynamic programming problems, where the control at each stage consists of several distinct decisions.
In an earlier work we introduced a policy iteration algorithm, where the policy improvement is done one-agent-at-a-time in a given order.
arXiv Detail & Related papers (2020-05-04T16:34:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.