Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning
- URL: http://arxiv.org/abs/2006.03886v2
- Date: Tue, 3 Nov 2020 21:02:51 GMT
- Title: Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning
- Authors: Nathan Kallus, Masatoshi Uehara
- Abstract summary: We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
- Score: 80.42316902296832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the efficient off-policy evaluation of natural stochastic policies,
which are defined in terms of deviations from the behavior policy. This is a
departure from the literature on off-policy evaluation where most work consider
the evaluation of explicitly specified policies. Crucially, offline
reinforcement learning with natural stochastic policies can help alleviate
issues of weak overlap, lead to policies that build upon current practice, and
improve policies' implementability in practice. Compared with the classic case
of a pre-specified evaluation policy, when evaluating natural stochastic
policies, the efficiency bound, which measures the best-achievable estimation
error, is inflated since the evaluation policy itself is unknown. In this
paper, we derive the efficiency bounds of two major types of natural stochastic
policies: tilting policies and modified treatment policies. We then propose
efficient nonparametric estimators that attain the efficiency bounds under very
lax conditions. These also enjoy a (partial) double robustness property.
Related papers
- Efficient Multi-Policy Evaluation for Reinforcement Learning [25.83084281519926]
We design a tailored behavior policy to reduce the variance of estimators across all target policies.
We show our estimator has a substantially lower variance compared with previous best methods.
arXiv Detail & Related papers (2024-08-16T12:33:40Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Beyond the Policy Gradient Theorem for Efficient Policy Updates in
Actor-Critic Algorithms [10.356356383401566]
In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states.
We discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target.
We introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $mathcalO(t-1)$ under classic assumptions.
arXiv Detail & Related papers (2022-02-15T15:04:10Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Stable and Efficient Policy Evaluation [31.04376768927044]
This paper introduces novel algorithms that are both off-policy stable and on-policy efficient by using the oblique projection method.
The empirical results on various domains validate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2020-06-06T21:14:06Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - Efficient Policy Learning from Surrogate-Loss Classification Reductions [65.91730154730905]
We consider the estimation problem given by a weighted surrogate-loss classification reduction of policy learning.
We show that, under a correct specification assumption, the weighted classification formulation need not be efficient for policy parameters.
We propose an estimation approach based on generalized method of moments, which is efficient for the policy parameters.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.