ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling
- URL: http://arxiv.org/abs/2203.04510v1
- Date: Wed, 9 Mar 2022 03:41:15 GMT
- Title: ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling
- Authors: Subhojyoti Mukherjee, Josiah P. Hanna, Robert Nowak
- Abstract summary: We develop theory for optimal data collection within the class of tree-structured MDPs.
We empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy.
- Score: 10.925914554822343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies the problem of data collection for policy evaluation in
Markov decision processes (MDPs). In policy evaluation, we are given a target
policy and asked to estimate the expected cumulative reward it will obtain in
an environment formalized as an MDP. We develop theory for optimal data
collection within the class of tree-structured MDPs by first deriving an oracle
data collection strategy that uses knowledge of the variance of the reward
distributions. We then introduce the Reduced Variance Sampling (ReVar)
algorithm that approximates the oracle strategy when the reward variances are
unknown a priori and bound its sub-optimality compared to the oracle strategy.
Finally, we empirically validate that ReVar leads to policy evaluation with
mean squared error comparable to the oracle strategy and significantly lower
than simply running the target policy.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Distributionally Robust Policy Evaluation under General Covariate Shift
in Contextual Bandits [34.4901181676133]
We introduce a distributionally robust approach that enhances the reliability of offline policy evaluation in contextual bandits.
Our method aims to deliver robust policy evaluation results in the presence of discrepancies in both context and policy distribution.
arXiv Detail & Related papers (2024-01-21T00:42:06Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - SPEED: Experimental Design for Policy Evaluation in Linear
Heteroscedastic Bandits [13.02672341061555]
We study the problem of optimal data collection for policy evaluation in linear bandits.
We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting.
We then use this formulation to derive the optimal allocation of samples per action during data collection.
arXiv Detail & Related papers (2023-01-29T04:33:13Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - On-Policy Deep Reinforcement Learning for the Average-Reward Criterion [9.343119070691735]
We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL)
In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
arXiv Detail & Related papers (2021-06-14T12:12:09Z) - A Deep Reinforcement Learning Approach to Marginalized Importance
Sampling with the Successor Representation [61.740187363451746]
Marginalized importance sampling (MIS) measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution.
We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy.
We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
arXiv Detail & Related papers (2021-06-12T20:21:38Z) - Robust Batch Policy Learning in Markov Decision Processes [0.0]
We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP)
We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
arXiv Detail & Related papers (2020-11-09T04:41:21Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.