Related papers: ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling

ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling

URL: http://arxiv.org/abs/2203.04510v1
Date: Wed, 9 Mar 2022 03:41:15 GMT
Title: ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling
Authors: Subhojyoti Mukherjee, Josiah P. Hanna, Robert Nowak
Abstract summary: We develop theory for optimal data collection within the class of tree-structured MDPs. We empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy.
Score: 10.925914554822343
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper studies the problem of data collection for policy evaluation in Markov decision processes (MDPs). In policy evaluation, we are given a target policy and asked to estimate the expected cumulative reward it will obtain in an environment formalized as an MDP. We develop theory for optimal data collection within the class of tree-structured MDPs by first deriving an oracle data collection strategy that uses knowledge of the variance of the reward distributions. We then introduce the Reduced Variance Sampling (ReVar) algorithm that approximates the oracle strategy when the reward variances are unknown a priori and bound its sub-optimality compared to the oracle strategy. Finally, we empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy and significantly lower than simply running the target policy.

Related papers

Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model [84.00480999255628]
Reinforcement Learning algorithms for safety alignment of Large Language Models (LLMs) encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy. We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals.
arXiv Detail & Related papers (2025-03-13T06:40:34Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Regression-aware Inference with LLMs [52.764328080398805]
We show that an inference strategy can be sub-optimal for common regression and scoring evaluation metrics. We propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from sampled responses.
arXiv Detail & Related papers (2024-03-07T03:24:34Z)
Distributionally Robust Policy Evaluation under General Covariate Shift in Contextual Bandits [31.571978291138866]
We introduce a distributionally robust approach that enhances the reliability of offline policy evaluation in contextual bandits. Our method aims to deliver robust policy evaluation results in the presence of discrepancies in both context and policy distribution.
arXiv Detail & Related papers (2024-01-21T00:42:06Z)
Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies. Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits [13.02672341061555]
We study the problem of optimal data collection for policy evaluation in linear bandits. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting. We then use this formulation to derive the optimal allocation of samples per action during data collection.
arXiv Detail & Related papers (2023-01-29T04:33:13Z)
Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation. By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation [61.740187363451746]
Marginalized importance sampling (MIS) measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
arXiv Detail & Related papers (2021-06-12T20:21:38Z)
Robust Batch Policy Learning in Markov Decision Processes [0.0]
We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP) We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
arXiv Detail & Related papers (2020-11-09T04:41:21Z)
Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z)
Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy. Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed. This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.