Related papers: Variance-Reduced Off-Policy Memory-Efficient Policy Search

Variance-Reduced Off-Policy Memory-Efficient Policy Search

URL: http://arxiv.org/abs/2009.06548v1
Date: Mon, 14 Sep 2020 16:22:46 GMT
Title: Variance-Reduced Off-Policy Memory-Efficient Policy Search
Authors: Daoming Lyu, Qi Qi, Mohammad Ghavamzadeh, Hengshuai Yao, Tianbao Yang, Bo Liu
Abstract summary: Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
Score: 61.23789485979057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy policy optimization is a challenging problem in reinforcement learning (RL). The algorithms designed for this problem often suffer from high variance in their estimators, which results in poor sample efficiency, and have issues with convergence. A few variance-reduced on-policy policy gradient algorithms have been recently proposed that use methods from stochastic optimization to reduce the variance of the gradient estimate in the REINFORCE algorithm. However, these algorithms are not designed for the off-policy setting and are memory-inefficient, since they need to collect and store a large ``reference'' batch of samples from time to time. To achieve variance-reduced off-policy-stable policy optimization, we propose an algorithm family that is memory-efficient, stochastically variance-reduced, and capable of learning from off-policy samples. Empirical studies validate the effectiveness of the proposed approaches.

Related papers

Policy Testing in Markov Decision Processes [48.642181362172906]
We study the policy testing problem in discounted decision processes (MDP) under the fixed-confidence setting.<n>The goal is to determine whether the value of a given policy exceeds a numerical threshold.
arXiv Detail & Related papers (2025-05-21T10:13:54Z)
An Efficient On-Policy Deep Learning Framework for Stochastic Optimal Control [14.832859803172846]
We present a novel on-policy algorithm for solving optimal control (SOC) problems.<n>By leveraging the Girsanov theorem, our method directly computes on-policy gradients of the SOC objective without expensive backpropagation through differential equations or adjoint problem solutions.<n> Experimental results demonstrate substantial improvements in both computational speed and memory efficiency compared to existing approaches.
arXiv Detail & Related papers (2024-10-07T16:16:53Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation. We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z)
Distillation Policy Optimization [5.439020425819001]
We introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control. This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline. Our results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches.
arXiv Detail & Related papers (2023-02-01T15:59:57Z)
Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics [5.270497591225775]
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward. Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space. We propose a novel algorithm for constrained RL that does not suffer from these limitations.
arXiv Detail & Related papers (2022-12-03T01:54:55Z)
A Policy Efficient Reduction Approach to Convex Constrained Deep Reinforcement Learning [2.811714058940267]
We propose a new variant of the conditional gradient (CG) type algorithm, which generalizes the minimum norm point (MNP) method. Our method reduces the memory costs by an order of magnitude, and achieves better performance, demonstrating both its effectiveness and efficiency.
arXiv Detail & Related papers (2021-08-29T20:51:32Z)
Offline RL Without Off-Policy Evaluation [49.11859771578969]
We show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark.
arXiv Detail & Related papers (2021-06-16T16:04:26Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model. We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. We build on the general sample efficiency of off-policy algorithms. We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.