Projected State-action Balancing Weights for Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2109.04640v1
- Date: Fri, 10 Sep 2021 03:00:44 GMT
- Title: Projected State-action Balancing Weights for Offline Reinforcement
Learning
- Authors: Jiayi Wang, Zhengling Qi and Raymond K.W. Wong
- Abstract summary: This paper focuses on the value estimation of a target policy based on pre-collected data generated from a possibly different policy.
We propose a novel estimator with approximately projected state-action balancing weights for the policy value estimation.
Numerical experiments demonstrate the promising performance of our proposed estimator.
- Score: 9.732863739456034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline policy evaluation (OPE) is considered a fundamental and challenging
problem in reinforcement learning (RL). This paper focuses on the value
estimation of a target policy based on pre-collected data generated from a
possibly different policy, under the framework of infinite-horizon Markov
decision processes. Motivated by the recently developed marginal importance
sampling method in RL and the covariate balancing idea in causal inference, we
propose a novel estimator with approximately projected state-action balancing
weights for the policy value estimation. We obtain the convergence rate of
these weights, and show that the proposed value estimator is semi-parametric
efficient under technical conditions. In terms of asymptotics, our results
scale with both the number of trajectories and the number of decision points at
each trajectory. As such, consistency can still be achieved with a limited
number of subjects when the number of decision points diverges. In addition, we
make a first attempt towards characterizing the difficulty of OPE problems,
which may be of independent interest. Numerical experiments demonstrate the
promising performance of our proposed estimator.
Related papers
- Post Reinforcement Learning Inference [22.117487428829488]
We consider estimation and inference using data collected from reinforcement learning algorithms.
We propose a weighted Z-estimation approach with carefully designed adaptive weights to stabilize the time-varying variance.
Primary applications include dynamic treatment effect estimation and dynamic off-policy evaluation.
arXiv Detail & Related papers (2023-02-17T12:53:15Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Low Variance Off-policy Evaluation with State-based Importance Sampling [21.727827944373793]
This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight.
Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.
arXiv Detail & Related papers (2022-12-07T19:56:11Z) - Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in
Partially Observed Markov Decision Processes [65.91730154730905]
In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors.
Here we tackle this by considering off-policy evaluation in a partially observed Markov decision process (POMDP)
We extend the framework of proximal causal inference to our POMDP setting, providing a variety of settings where identification is made possible.
arXiv Detail & Related papers (2021-10-28T17:46:14Z) - Off-Policy Evaluation in Partially Observed Markov Decision Processes
under Sequential Ignorability [8.388782503421504]
We consider off-policy evaluation of dynamic treatment rules under sequential ignorability.
We show that off-policy evaluation in POMDPs is strictly harder than off-policy evaluation in (fully observed) Markov decision processes.
arXiv Detail & Related papers (2021-10-24T03:35:23Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.