Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
- URL: http://arxiv.org/abs/2301.13734v5
- Date: Wed, 02 Oct 2024 03:41:22 GMT
- Title: Efficient Policy Evaluation with Offline Data Informed Behavior Policy Design
- Authors: Shuze Liu, Shangtong Zhang,
- Abstract summary: We propose novel methods that improve the data efficiency of online Monte Carlo estimators.
We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator.
We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data.
- Score: 18.326126953667842
- License:
- Abstract: Most reinforcement learning practitioners evaluate their policies with online Monte Carlo estimators for either hyperparameter tuning or testing different algorithmic design choices, where the policy is repeatedly executed in the environment to get the average outcome. Such massive interactions with the environment are prohibitive in many scenarios. In this paper, we propose novel methods that improve the data efficiency of online Monte Carlo estimators while maintaining their unbiasedness. We first propose a tailored closed-form behavior policy that provably reduces the variance of an online Monte Carlo estimator. We then design efficient algorithms to learn this closed-form behavior policy from previously collected offline data. Theoretical analysis is provided to characterize how the behavior policy learning error affects the amount of reduced variance. Compared with previous works, our method achieves better empirical performance in a broader set of environments, with fewer requirements for offline data.
Related papers
- $Δ\text{-}{\rm OPE}$: Off-Policy Estimation with Pairs of Policies [13.528097424046823]
We introduce $Deltatext-rm OPE$ methods based on the widely used Inverse Propensity Scoring estimator.
Simulated, offline, and online experiments show that our methods significantly improve performance for both evaluation and learning tasks.
arXiv Detail & Related papers (2024-05-16T12:04:55Z) - Offline Imitation Learning with Suboptimal Demonstrations via Relaxed
Distribution Matching [109.5084863685397]
offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment.
We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization.
Our method significantly outperforms the best prior offline method in six standard continuous control environments.
arXiv Detail & Related papers (2023-03-05T03:35:11Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z) - Optimal Mixture Weights for Off-Policy Evaluation with Multiple Behavior
Policies [3.855085732184416]
Off-policy evaluation is a key component of reinforcement learning which evaluates a target policy with offline data collected from behavior policies.
This paper discusses how to correctly mix estimators produced by different behavior policies.
Experiments on simulated recommender systems show that our methods are effective in reducing the Mean-Square Error of estimation.
arXiv Detail & Related papers (2020-11-29T12:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.