Off-Policy Evaluation for Large Action Spaces via Policy Convolution
- URL: http://arxiv.org/abs/2310.15433v1
- Date: Tue, 24 Oct 2023 01:00:01 GMT
- Title: Off-Policy Evaluation for Large Action Spaces via Policy Convolution
- Authors: Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, Julian
McAuley
- Abstract summary: Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
- Score: 60.6953713877886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing accurate off-policy estimators is crucial for both evaluating and
optimizing for new policies. The main challenge in off-policy estimation is the
distribution shift between the logging policy that generates data and the
target policy that we aim to evaluate. Typically, techniques for correcting
distribution shift involve some form of importance sampling. This approach
results in unbiased value estimation but often comes with the trade-off of high
variance, even in the simpler case of one-step contextual bandits. Furthermore,
importance sampling relies on the common support assumption, which becomes
impractical when the action space is large. To address these challenges, we
introduce the Policy Convolution (PC) family of estimators. These methods
leverage latent structure within actions -- made available through action
embeddings -- to strategically convolve the logging and target policies. This
convolution introduces a unique bias-variance trade-off, which can be
controlled by adjusting the amount of convolution. Our experiments on synthetic
and benchmark datasets demonstrate remarkable mean squared error (MSE)
improvements when using PC, especially when either the action space or policy
mismatch becomes large, with gains of up to 5 - 6 orders of magnitude over
existing estimators.
Related papers
- Statistical Analysis of Policy Space Compression Problem [54.1754937830779]
Policy search methods are crucial in reinforcement learning, offering a framework to address continuous state-action and partially observable problems.
Reducing the policy space through policy compression emerges as a powerful, reward-free approach to accelerate the learning process.
This technique condenses the policy space into a smaller, representative set while maintaining most of the original effectiveness.
arXiv Detail & Related papers (2024-11-15T02:46:55Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Off-Policy Evaluation in Markov Decision Processes under Weak
Distributional Overlap [5.0401589279256065]
We re-visit the task of off-policy evaluation in Markov decision processes (MDPs) under a weaker notion of distributional overlap.
We introduce a class of truncated doubly robust (TDR) estimators which we find to perform well in this setting.
arXiv Detail & Related papers (2024-02-13T03:55:56Z) - Distributionally Robust Policy Evaluation under General Covariate Shift in Contextual Bandits [31.571978291138866]
We introduce a distributionally robust approach that enhances the reliability of offline policy evaluation in contextual bandits.
Our method aims to deliver robust policy evaluation results in the presence of discrepancies in both context and policy distribution.
arXiv Detail & Related papers (2024-01-21T00:42:06Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Low Variance Off-policy Evaluation with State-based Importance Sampling [21.727827944373793]
This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight.
Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.
arXiv Detail & Related papers (2022-12-07T19:56:11Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.