Robust Batch Policy Learning in Markov Decision Processes
- URL: http://arxiv.org/abs/2011.04185v4
- Date: Wed, 10 Nov 2021 04:06:29 GMT
- Title: Robust Batch Policy Learning in Markov Decision Processes
- Authors: Zhengling Qi, Peng Liao
- Abstract summary: We study the offline data-driven sequential decision making problem in the framework of Markov decision process (MDP)
We propose to evaluate each policy by a set of the average rewards with respect to distributions centered at the policy induced stationary distribution.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the offline data-driven sequential decision making problem in the
framework of Markov decision process (MDP). In order to enhance the
generalizability and adaptivity of the learned policy, we propose to evaluate
each policy by a set of the average rewards with respect to distributions
centered at the policy induced stationary distribution. Given a pre-collected
dataset of multiple trajectories generated by some behavior policy, our goal is
to learn a robust policy in a pre-specified policy class that can maximize the
smallest value of this set. Leveraging the theory of semi-parametric
statistics, we develop a statistically efficient policy learning method for
estimating the de ned robust optimal policy. A rate-optimal regret bound up to
a logarithmic factor is established in terms of total decision points in the
dataset.
Related papers
- Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Counterfactual Learning with General Data-generating Policies [3.441021278275805]
We develop an OPE method for a class of full support and deficient support logging policies in contextual-bandit settings.
We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases.
arXiv Detail & Related papers (2022-12-04T21:07:46Z) - Randomized Policy Optimization for Optimal Stopping [0.0]
We propose a new methodology for optimal stopping based on randomized linear policies.
We show that our approach can substantially outperform state-of-the-art methods.
arXiv Detail & Related papers (2022-03-25T04:33:15Z) - ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling [10.925914554822343]
We develop theory for optimal data collection within the class of tree-structured MDPs.
We empirically validate that ReVar leads to policy evaluation with mean squared error comparable to the oracle strategy.
arXiv Detail & Related papers (2022-03-09T03:41:15Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Batch Policy Learning in Average Reward Markov Decision Processes [3.9023554886892438]
Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward.
We develop an optimization algorithm to compute the optimal policy in a parameterized policy class.
The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy.
arXiv Detail & Related papers (2020-07-23T03:28:14Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.