Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning
- URL: http://arxiv.org/abs/2308.14897v1
- Date: Mon, 28 Aug 2023 20:46:07 GMT
- Title: Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning
- Authors: Hanhan Zhou, Tian Lan, Vaneet Aggarwal
- Abstract summary: We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
- Score: 53.97273491846883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning aims to utilize datasets of previously
gathered environment-action interaction records to learn a policy without
access to the real environment. Recent work has shown that offline
reinforcement learning can be formulated as a sequence modeling problem and
solved via supervised learning with approaches such as decision transformer.
While these sequence-based methods achieve competitive results over
return-to-go methods, especially on tasks that require longer episodes or with
scarce rewards, importance sampling is not considered to correct the policy
bias when dealing with off-policy data, mainly due to the absence of behavior
policy and the use of deterministic evaluation policies. To this end, we
propose DPE: an RL algorithm that blends offline sequence modeling and offline
reinforcement learning with Double Policy Estimation (DPE) in a unified
framework with statistically proven properties on variance reduction. We
validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks. Our
method brings a performance improvements on selected methods which outperforms
SOTA baselines in several tasks, demonstrating the advantages of enabling
double policy estimation for sequence-modeled reinforcement learning.
Related papers
- Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Model-based trajectory stitching for improved behavioural cloning and
its applications [7.462336024223669]
Trajectory Stitching (TS) generates new trajectories by stitching' pairs of states that were disconnected in the original data.
We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy.
arXiv Detail & Related papers (2022-12-08T14:18:04Z) - Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets.
We propose a novel approach, which we refer to as adaptive behavior regularization (ABR)
ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z) - Offline Reinforcement Learning via High-Fidelity Generative Behavior
Modeling [34.88897402357158]
We show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training.
We adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model.
Our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2022-09-29T04:36:23Z) - Mitigating Off-Policy Bias in Actor-Critic Methods with One-Step
Q-learning: A Novel Correction Approach [0.0]
We introduce a novel policy similarity measure to mitigate the effects of such discrepancy in continuous control.
Our method offers an adequate single-step off-policy correction that is applicable to deterministic policy networks.
arXiv Detail & Related papers (2022-08-01T11:33:12Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.