Variance Reduction based Experience Replay for Policy Optimization
- URL: http://arxiv.org/abs/2110.08902v4
- Date: Fri, 12 Apr 2024 22:13:14 GMT
- Title: Variance Reduction based Experience Replay for Policy Optimization
- Authors: Hua Zheng, Wei Xie, M. Ben Feng,
- Abstract summary: Variance Reduction Experience Replay (VRER) is a framework for the selective reuse of relevant samples to improve policy gradient estimation.
VRER forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER.
- Score: 3.0790370651488983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For reinforcement learning on complex stochastic systems, it is desirable to effectively leverage the information from historical samples collected in previous iterations to accelerate policy optimization. Classical experience replay, while effective, treats all observations uniformly, neglecting their relative importance. To address this limitation, we introduce a novel Variance Reduction Experience Replay (VRER) framework, enabling the selective reuse of relevant samples to improve policy gradient estimation. VRER, as an adaptable method that can seamlessly integrate with different policy optimization algorithms, forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER (PG-VRER). Furthermore, the lack of a rigorous understanding of the experience replay approach in the literature motivates us to introduce a novel theoretical framework that accounts for sample dependencies induced by Markovian noise and behavior policy interdependencies. This framework is then employed to analyze the finite-time convergence of the proposed PG-VRER algorithm, revealing a crucial bias-variance trade-off in policy gradient estimation: the reuse of older experience tends to introduce a larger bias while simultaneously reducing gradient estimation variance. Extensive experiments have shown that VRER offers a notable and consistent acceleration in learning optimal policies and enhances the performance of state-of-the-art (SOTA) policy optimization approaches.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Reusing Historical Trajectories in Natural Policy Gradient via
Importance Sampling: Convergence and Convergence Rate [8.943964058164257]
We study a variant of the natural policy reusing historical trajectories via importance gradient sampling.
We show that the bias of the proposed estimator of gradient sampling is gradientally negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate.
arXiv Detail & Related papers (2024-03-01T17:08:30Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Variance Reduction based Experience Replay for Policy Optimization [3.0657293044976894]
We propose a general variance reduction based experience replay (VRER) framework that can selectively reuse the most relevant samples to improve policy gradient estimation.
Our theoretical and empirical studies show that the proposed VRER can accelerate the learning of optimal policy and enhance the performance of state-of-the-art policy optimization approaches.
arXiv Detail & Related papers (2022-08-25T20:51:00Z) - Variance Reduction based Partial Trajectory Reuse to Accelerate Policy
Gradient Optimization [3.621753051212441]
We extend the idea of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for Markov Decision Processes (MDP)
In this paper, the mixture likelihood ratio (MLR) based policy gradient estimation is used to leverage the information from historical state decision transitions generated under different behavioral policies.
arXiv Detail & Related papers (2022-05-06T01:42:28Z) - Bag of Tricks for Natural Policy Gradient Reinforcement Learning [87.54231228860495]
We have implemented and compared strategies that impact performance in natural policy gradient reinforcement learning.
The proposed collection of strategies for performance optimization can improve results by 86% to 181% across the MuJuCo control benchmark.
arXiv Detail & Related papers (2022-01-22T17:44:19Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Adaptive Experience Selection for Policy Gradient [8.37609145576126]
Experience replay is a commonly used approach to improve sample efficiency.
gradient estimators using past trajectories typically have high variance.
Existing sampling strategies for experience replay like uniform sampling or prioritised experience replay do not explicitly try to control the variance of the gradient estimates.
We propose an online learning algorithm, adaptive experience selection (AES), to adaptively learn an experience sampling distribution that explicitly minimises this variance.
arXiv Detail & Related papers (2020-02-17T13:16:37Z) - A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes.
We build on the general sample efficiency of off-policy algorithms.
We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.