Bootstrap Advantage Estimation for Policy Optimization in Reinforcement
Learning
- URL: http://arxiv.org/abs/2210.07312v1
- Date: Thu, 13 Oct 2022 19:30:43 GMT
- Title: Bootstrap Advantage Estimation for Policy Optimization in Reinforcement
Learning
- Authors: Md Masudur Rahman, Yexiang Xue
- Abstract summary: This paper proposes an advantage estimation approach based on data augmentation for policy optimization.
Our method uses data augmentation to compute a bootstrap advantage estimation.
We observe that our method reduces the policy and the value loss better than the Generalized advantage estimation.
- Score: 16.999444076456268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an advantage estimation approach based on data
augmentation for policy optimization. Unlike using data augmentation on the
input to learn value and policy function as existing methods use, our method
uses data augmentation to compute a bootstrap advantage estimation. This
Bootstrap Advantage Estimation (BAE) is then used for learning and updating the
gradient of policy and value function. To demonstrate the effectiveness of our
approach, we conducted experiments on several environments. These environments
are from three benchmarks: Procgen, Deepmind Control, and Pybullet, which
include both image and vector-based observations; discrete and continuous
action spaces. We observe that our method reduces the policy and the value loss
better than the Generalized advantage estimation (GAE) method and eventually
improves cumulative return. Furthermore, our method performs better than two
recently proposed data augmentation techniques (RAD and DRAC). Overall, our
method performs better empirically than baselines in sample efficiency and
generalization, where the agent is tested in unseen environments.
Related papers
- Doubly Optimal Policy Evaluation for Reinforcement Learning [16.7091722884524]
Policy evaluation often suffers from large variance and requires massive data to achieve desired accuracy.
In this work, we design an optimal combination of data-collecting policy and data-processing baseline.
Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods.
arXiv Detail & Related papers (2024-10-03T05:47:55Z) - $Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies [13.528097424046823]
We introduce $Deltatext-rm OPE$ methods based on the widely used Inverse Propensity Scoring estimator.
Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
arXiv Detail & Related papers (2024-05-16T12:04:55Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Value Enhancement of Reinforcement Learning via Efficient and Robust
Trust Region Optimization [14.028916306297928]
Reinforcement learning (RL) is a powerful machine learning technique that enables an intelligent agent to learn an optimal policy.
We propose a novel value enhancement method to improve the performance of a given initial policy computed by existing state-of-the-art RL algorithms.
arXiv Detail & Related papers (2023-01-05T18:43:40Z) - Variance-Optimal Augmentation Logging for Counterfactual Evaluation in
Contextual Bandits [25.153656462604268]
Methods for offline A/B testing and counterfactual learning are seeing rapid adoption in search and recommender systems.
The counterfactual estimators that are commonly used in these methods can have large bias and large variance when the logging policy is very different from the target policy being evaluated.
This paper introduces Minimum Variance Augmentation Logging (MVAL), a method for constructing logging policies that minimize the variance of the downstream evaluation or learning problem.
arXiv Detail & Related papers (2022-02-03T17:37:11Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.