Efficient Multi-Policy Evaluation for Reinforcement Learning
- URL: http://arxiv.org/abs/2408.08706v1
- Date: Fri, 16 Aug 2024 12:33:40 GMT
- Title: Efficient Multi-Policy Evaluation for Reinforcement Learning
- Authors: Shuze Liu, Yuxin Chen, Shangtong Zhang,
- Abstract summary: We design a tailored behavior policy to reduce the variance of estimators across all target policies.
We show our estimator has a substantially lower variance compared with previous best methods.
- Score: 25.83084281519926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To unbiasedly evaluate multiple target policies, the dominant approach among RL practitioners is to run and evaluate each target policy separately. However, this evaluation method is far from efficient because samples are not shared across policies, and running target policies to evaluate themselves is actually not optimal. In this paper, we address these two weaknesses by designing a tailored behavior policy to reduce the variance of estimators across all target policies. Theoretically, we prove that executing this behavior policy with manyfold fewer samples outperforms on-policy evaluation on every target policy under characterized conditions. Empirically, we show our estimator has a substantially lower variance compared with previous best methods and achieves state-of-the-art performance in a broad range of environments.
Related papers
- POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition [40.851324484481275]
We study off-policy learning of contextual bandit policies in large discrete action spaces.
We propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition.
We show that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
arXiv Detail & Related papers (2024-02-09T03:01:13Z) - Distributionally Robust Policy Evaluation under General Covariate Shift in Contextual Bandits [31.571978291138866]
We introduce a distributionally robust approach that enhances the reliability of offline policy evaluation in contextual bandits.
Our method aims to deliver robust policy evaluation results in the presence of discrepancies in both context and policy distribution.
arXiv Detail & Related papers (2024-01-21T00:42:06Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - A Practical Guide of Off-Policy Evaluation for Bandit Problems [13.607327477092877]
Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samples obtained via different policies.
We propose a meta-algorithm based on existing OPE estimators.
We investigate the proposed concepts using synthetic and open real-world datasets in experiments.
arXiv Detail & Related papers (2020-10-23T15:11:19Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.