Variance Reduction based Experience Replay for Policy Optimization
- URL: http://arxiv.org/abs/2208.12341v1
- Date: Thu, 25 Aug 2022 20:51:00 GMT
- Title: Variance Reduction based Experience Replay for Policy Optimization
- Authors: Hua Zheng, Wei Xie, M. Ben Feng
- Abstract summary: We propose a general variance reduction based experience replay (VRER) framework that can selectively reuse the most relevant samples to improve policy gradient estimation.
Our theoretical and empirical studies show that the proposed VRER can accelerate the learning of optimal policy and enhance the performance of state-of-the-art policy optimization approaches.
- Score: 3.0657293044976894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For reinforcement learning on complex stochastic systems where many factors
dynamically impact the output trajectories, it is desirable to effectively
leverage the information from historical samples collected in previous
iterations to accelerate policy optimization. Classical experience replay
allows agents to remember by reusing historical observations. However, the
uniform reuse strategy that treats all observations equally overlooks the
relative importance of different samples. To overcome this limitation, we
propose a general variance reduction based experience replay (VRER) framework
that can selectively reuse the most relevant samples to improve policy gradient
estimation. This selective mechanism can adaptively put more weight on past
samples that are more likely to be generated by the current target
distribution. Our theoretical and empirical studies show that the proposed VRER
can accelerate the learning of optimal policy and enhance the performance of
state-of-the-art policy optimization approaches.
Related papers
- How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics [65.67654005892469]
We show that proper instance-dependent sampling can yield stronger ranking guarantees, while skewed on-policy sampling can induce excessive concentration under structured preferences.<n>We then analyze iterative alignment dynamics in which the learned policy feeds back into future sampling and reference policies.<n>Our theoretical insights extend to Direct Preference Optimization, indicating the phenomena we captured are common to a broader class of preference-alignment methods.
arXiv Detail & Related papers (2026-02-12T17:11:08Z) - Variance Reduction Based Experience Replay for Policy Optimization [3.7128732378843394]
Variance Reduction Experience Replay (VRER) is a principled framework that selectively reuses informative samples to reduce variance in policy gradient estimation.<n>VRER is algorithm-agnostic and integrates seamlessly with existing policy optimization methods.<n>We show that VRER consistently accelerates policy learning and improves performance over state-of-the-art policy optimization algorithms.
arXiv Detail & Related papers (2026-02-05T06:58:28Z) - Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning [52.97053840476386]
We show that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates.<n>We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved.
arXiv Detail & Related papers (2025-11-13T23:06:40Z) - RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization [65.23034604711489]
We introduce RLoop, a self-improving framework for training large reasoning models.<n>RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset.<n>Our experiments show RLoops forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
arXiv Detail & Related papers (2025-11-06T11:27:16Z) - Understanding the Impact of Sampling Quality in Direct Preference Optimization [4.122673728216191]
We study how data of higher quality can be leveraged to improve performance in Direct Preference Optimization (DPO)<n>Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the data-generating distribution.
arXiv Detail & Related papers (2025-06-03T18:12:40Z) - Reusing Historical Trajectories in Natural Policy Gradient via
Importance Sampling: Convergence and Convergence Rate [8.943964058164257]
We study a variant of the natural policy reusing historical trajectories via importance gradient sampling.
We show that the bias of the proposed estimator of gradient sampling is gradientally negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate.
arXiv Detail & Related papers (2024-03-01T17:08:30Z) - Truncating Trajectories in Monte Carlo Reinforcement Learning [48.97155920826079]
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cumulative discounted sum of an external reward signal.
We propose an a-priori budget allocation strategy that leads to the collection of trajectories of different lengths.
We show that an appropriate truncation of the trajectories can succeed in improving performance.
arXiv Detail & Related papers (2023-05-07T19:41:57Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Variance Reduction based Partial Trajectory Reuse to Accelerate Policy
Gradient Optimization [3.621753051212441]
We extend the idea of green simulation assisted policy gradient (GS-PG) to partial historical trajectory reuse for Markov Decision Processes (MDP)
In this paper, the mixture likelihood ratio (MLR) based policy gradient estimation is used to leverage the information from historical state decision transitions generated under different behavioral policies.
arXiv Detail & Related papers (2022-05-06T01:42:28Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Replay For Safety [51.11953997546418]
In experience replay, past transitions are stored in a memory buffer and re-used during learning.
We show that using an appropriate biased sampling scheme can allow us to achieve a emphsafe policy.
arXiv Detail & Related papers (2021-12-08T11:10:57Z) - Variance Reduction based Experience Replay for Policy Optimization [3.0790370651488983]
Variance Reduction Experience Replay (VRER) is a framework for the selective reuse of relevant samples to improve policy gradient estimation.
VRER forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER.
arXiv Detail & Related papers (2021-10-17T19:28:45Z) - Experience Replay with Likelihood-free Importance Weights [123.52005591531194]
We propose to reweight experiences based on their likelihood under the stationary distribution of the current policy.
We apply the proposed approach empirically on two competitive methods, Soft Actor Critic (SAC) and Twin Delayed Deep Deterministic policy gradient (TD3)
arXiv Detail & Related papers (2020-06-23T17:17:44Z) - Adaptive Experience Selection for Policy Gradient [8.37609145576126]
Experience replay is a commonly used approach to improve sample efficiency.
gradient estimators using past trajectories typically have high variance.
Existing sampling strategies for experience replay like uniform sampling or prioritised experience replay do not explicitly try to control the variance of the gradient estimates.
We propose an online learning algorithm, adaptive experience selection (AES), to adaptively learn an experience sampling distribution that explicitly minimises this variance.
arXiv Detail & Related papers (2020-02-17T13:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.