The Effect of Multi-step Methods on Overestimation in Deep Reinforcement
Learning
- URL: http://arxiv.org/abs/2006.12692v1
- Date: Tue, 23 Jun 2020 01:35:54 GMT
- Title: The Effect of Multi-step Methods on Overestimation in Deep Reinforcement
Learning
- Authors: Lingheng Meng, Rob Gorbet, Dana Kuli\'c
- Abstract summary: Multi-step (also called n-step) methods in reinforcement learning have been shown to be more efficient than the 1-step method.
We show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup.
We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error.
- Score: 6.181642248900806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-step (also called n-step) methods in reinforcement learning (RL) have
been shown to be more efficient than the 1-step method due to faster
propagation of the reward signal, both theoretically and empirically, in tasks
exploiting tabular representation of the value-function. Recently, research in
Deep Reinforcement Learning (DRL) also shows that multi-step methods improve
learning speed and final performance in applications where the value-function
and policy are represented with deep neural networks. However, there is a lack
of understanding about what is actually contributing to the boost of
performance. In this work, we analyze the effect of multi-step methods on
alleviating the overestimation problem in DRL, where multi-step experiences are
sampled from a replay buffer. Specifically building on top of Deep
Deterministic Policy Gradient (DDPG), we propose Multi-step DDPG (MDDPG), where
different step sizes are manually set, and its variant called Mixed Multi-step
DDPG (MMDDPG) where an average over different multi-step backups is used as
update target of Q-value function. Empirically, we show that both MDDPG and
MMDDPG are significantly less affected by the overestimation problem than DDPG
with 1-step backup, which consequently results in better final performance and
learning speed. We also discuss the advantages and disadvantages of different
ways to do multi-step expansion in order to reduce approximation error, and
expose the tradeoff between overestimation and underestimation that underlies
offline multi-step methods. Finally, we compare the computational resource
needs of Twin Delayed Deep Deterministic Policy Gradient (TD3), a state-of-art
algorithm proposed to address overestimation in actor-critic methods, and our
proposed methods, since they show comparable final performance and learning
speed.
Related papers
- One Step at a Time: Pros and Cons of Multi-Step Meta-Gradient
Reinforcement Learning [61.662504399411695]
We introduce a novel method mixing multiple inner steps that enjoys a more accurate and robust meta-gradient signal.
When applied to the Snake game, the mixing meta-gradient algorithm can cut the variance by a factor of 3 while achieving similar or higher performance.
arXiv Detail & Related papers (2021-10-30T08:36:52Z) - Multi-Task Meta-Learning Modification with Stochastic Approximation [0.7734726150561089]
A few-shot learning problem is one of the main benchmarks of meta-learning algorithms.
In this paper we investigate the modification of standard meta-learning pipeline that takes a multi-task approach during training.
The proposed method simultaneously utilizes information from several meta-training tasks in a common loss function.
Proper optimization of these weights can have a big influence on training of the entire model and might improve the quality on test time tasks.
arXiv Detail & Related papers (2021-10-25T18:11:49Z) - Learning to Perform Downlink Channel Estimation in Massive MIMO Systems [72.76968022465469]
We study downlink (DL) channel estimation in a Massive multiple-input multiple-output (MIMO) system.
A common approach is to use the mean value as the estimate, motivated by channel hardening.
We propose two novel estimation methods.
arXiv Detail & Related papers (2021-09-06T13:42:32Z) - Settling the Variance of Multi-Agent Policy Gradients [14.558011059649543]
Policy gradient (PG) methods are popular reinforcement learning (RL) methods.
In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG methods degrades as the variance of gradient estimates increases rapidly with the number of agents.
We offer a rigorous analysis of MAPG methods by quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators.
We propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL.
arXiv Detail & Related papers (2021-08-19T10:49:10Z) - Semi-On-Policy Training for Sample Efficient Multi-Agent Policy
Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods.
We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z) - Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement
Learning [7.020079427649125]
We show that grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance.
We propose a probabilistic mixture-of-experts (PMOE) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem.
arXiv Detail & Related papers (2021-04-19T08:21:56Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks.
The algorithm consistently generates control policies that outperform state-of-arts in literature.
A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z) - Off-Policy Multi-Agent Decomposed Policy Gradients [30.389041305278045]
We investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP)
DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment.
In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms.
arXiv Detail & Related papers (2020-07-24T02:21:55Z) - Momentum-Based Policy Gradient Methods [133.53164856723782]
We propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning.
In particular, we present a non-adaptive version of IS-MBPG method, which also reaches the best known sample complexity of $O(epsilon-3)$ without any large batches.
arXiv Detail & Related papers (2020-07-13T20:44:15Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.