Related papers: The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning

The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning

URL: http://arxiv.org/abs/2006.12692v1
Date: Tue, 23 Jun 2020 01:35:54 GMT
Title: The Effect of Multi-step Methods on Overestimation in Deep Reinforcement Learning
Authors: Lingheng Meng, Rob Gorbet, Dana Kuli\'c
Abstract summary: Multi-step (also called n-step) methods in reinforcement learning have been shown to be more efficient than the 1-step method. We show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup. We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error.
Score: 6.181642248900806
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-step (also called n-step) methods in reinforcement learning (RL) have been shown to be more efficient than the 1-step method due to faster propagation of the reward signal, both theoretically and empirically, in tasks exploiting tabular representation of the value-function. Recently, research in Deep Reinforcement Learning (DRL) also shows that multi-step methods improve learning speed and final performance in applications where the value-function and policy are represented with deep neural networks. However, there is a lack of understanding about what is actually contributing to the boost of performance. In this work, we analyze the effect of multi-step methods on alleviating the overestimation problem in DRL, where multi-step experiences are sampled from a replay buffer. Specifically building on top of Deep Deterministic Policy Gradient (DDPG), we propose Multi-step DDPG (MDDPG), where different step sizes are manually set, and its variant called Mixed Multi-step DDPG (MMDDPG) where an average over different multi-step backups is used as update target of Q-value function. Empirically, we show that both MDDPG and MMDDPG are significantly less affected by the overestimation problem than DDPG with 1-step backup, which consequently results in better final performance and learning speed. We also discuss the advantages and disadvantages of different ways to do multi-step expansion in order to reduce approximation error, and expose the tradeoff between overestimation and underestimation that underlies offline multi-step methods. Finally, we compare the computational resource needs of Twin Delayed Deep Deterministic Policy Gradient (TD3), a state-of-art algorithm proposed to address overestimation in actor-critic methods, and our proposed methods, since they show comparable final performance and learning speed.

Related papers

Deep Reinforcement Learning with Gradient Eligibility Traces [25.47053572017618]
We propose three gradient-based methods to achieve fast and stable off-policy learning in deep reinforcement learning.<n>We provide a forward-view formulation compatible with experience replay and a backward-view formulation compatible with streaming algorithms.<n>We evaluate the proposed algorithms and show that they outperform both PPO and StreamQ in MuJoCo and MinAtar environments.
arXiv Detail & Related papers (2025-07-12T00:12:05Z)
Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z)
Stabilizing Temporal Difference Learning via Implicit Stochastic Recursion [2.1301560294088318]
Temporal difference (TD) learning is a foundational algorithm in reinforcement learning (RL)<n>We propose implicit TD algorithms that reformulate TD updates into fixed point equations.<n>Our results show that implicit TD algorithms are applicable to a much broader range of step sizes.
arXiv Detail & Related papers (2025-05-02T15:57:54Z)
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. We propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
arXiv Detail & Related papers (2025-03-02T13:43:53Z)
One Step at a Time: Pros and Cons of Multi-Step Meta-Gradient Reinforcement Learning [61.662504399411695]
We introduce a novel method mixing multiple inner steps that enjoys a more accurate and robust meta-gradient signal. When applied to the Snake game, the mixing meta-gradient algorithm can cut the variance by a factor of 3 while achieving similar or higher performance.
arXiv Detail & Related papers (2021-10-30T08:36:52Z)
Multi-Task Meta-Learning Modification with Stochastic Approximation [0.7734726150561089]
A few-shot learning problem is one of the main benchmarks of meta-learning algorithms. In this paper we investigate the modification of standard meta-learning pipeline that takes a multi-task approach during training. The proposed method simultaneously utilizes information from several meta-training tasks in a common loss function. Proper optimization of these weights can have a big influence on training of the entire model and might improve the quality on test time tasks.
arXiv Detail & Related papers (2021-10-25T18:11:49Z)
Learning to Perform Downlink Channel Estimation in Massive MIMO Systems [72.76968022465469]
We study downlink (DL) channel estimation in a Massive multiple-input multiple-output (MIMO) system. A common approach is to use the mean value as the estimate, motivated by channel hardening. We propose two novel estimation methods.
arXiv Detail & Related papers (2021-09-06T13:42:32Z)
Settling the Variance of Multi-Agent Policy Gradients [14.558011059649543]
Policy gradient (PG) methods are popular reinforcement learning (RL) methods. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG methods degrades as the variance of gradient estimates increases rapidly with the number of agents. We offer a rigorous analysis of MAPG methods by quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators. We propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL.
arXiv Detail & Related papers (2021-08-19T10:49:10Z)
Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients [51.749831824106046]
We introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods. We show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
arXiv Detail & Related papers (2021-04-27T19:37:01Z)
Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement Learning [7.020079427649125]
We show that grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance. We propose a probabilistic mixture-of-experts (PMOE) for multimodal policy, together with a novel gradient estimator for the indifferentiability problem.
arXiv Detail & Related papers (2021-04-19T08:21:56Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Queueing Network Controls via Deep Reinforcement Learning [0.0]
We develop a Proximal policy optimization algorithm for queueing networks. The algorithm consistently generates control policies that outperform state-of-arts in literature. A key to the successes of our PPO algorithm is the use of three variance reduction techniques in estimating the relative value function.
arXiv Detail & Related papers (2020-07-31T01:02:57Z)
Off-Policy Multi-Agent Decomposed Policy Gradients [30.389041305278045]
We investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP) DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms.
arXiv Detail & Related papers (2020-07-24T02:21:55Z)
Momentum-Based Policy Gradient Methods [133.53164856723782]
We propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning. In particular, we present a non-adaptive version of IS-MBPG method, which also reaches the best known sample complexity of $O(epsilon-3)$ without any large batches.
arXiv Detail & Related papers (2020-07-13T20:44:15Z)
Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL) We propose Zeroth-Order Supervised Policy Improvement (ZOSPI) ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.