Supervised Off-Policy Ranking
- URL: http://arxiv.org/abs/2107.01360v1
- Date: Sat, 3 Jul 2021 07:01:23 GMT
- Title: Supervised Off-Policy Ranking
- Authors: Yue Jin, Yue Zhang, Tao Qin, Xudong Zhang, Jian Yuan, Houqiang Li,
Tie-Yan Liu
- Abstract summary: Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
- Score: 145.3039527243585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation (OPE) leverages data generated by other policies to
evaluate a target policy. Previous OPE methods mainly focus on precisely
estimating the true performance of a policy. We observe that in many
applications, (1) the end goal of OPE is to compare two or multiple candidate
policies and choose a good one, which is actually a much simpler task than
evaluating their true performance; and (2) there are usually multiple policies
that have been deployed in real-world systems and thus whose true performance
is known through serving real users. Inspired by the two observations, in this
work, we define a new problem, supervised off-policy ranking (SOPR), which aims
to rank a set of new/target policies based on supervised learning by leveraging
off-policy data and policies with known performance. We further propose a
method for supervised off-policy ranking that learns a policy scoring model by
correctly ranking training policies with known performance rather than
estimating their precise performance. Our method leverages logged states and
policies to learn a Transformer based model that maps offline interaction data
including logged states and the actions taken by a target policy on these
states to a score. Experiments on different games, datasets, training policy
sets, and test policy sets show that our method outperforms strong baseline OPE
methods in terms of both rank correlation and performance gap between the truly
best and the best of the ranked top three policies. Furthermore, our method is
more stable than baseline methods.
Related papers
- Efficient Multi-Policy Evaluation for Reinforcement Learning [25.83084281519926]
We design a tailored behavior policy to reduce the variance of estimators across all target policies.
We show our estimator has a substantially lower variance compared with previous best methods.
arXiv Detail & Related papers (2024-08-16T12:33:40Z) - POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition [40.851324484481275]
We study off-policy learning of contextual bandit policies in large discrete action spaces.
We propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition.
We show that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
arXiv Detail & Related papers (2024-02-09T03:01:13Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Counterfactual Learning with General Data-generating Policies [3.441021278275805]
We develop an OPE method for a class of full support and deficient support logging policies in contextual-bandit settings.
We prove that our method's prediction converges in probability to the true performance of a counterfactual policy as the sample size increases.
arXiv Detail & Related papers (2022-12-04T21:07:46Z) - Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking [2.8176502405615396]
Evolution Strategy (ES) is a powerful black-box optimization technique based on the idea of natural evolution.
We propose a novel off-policy alternative for ranking, based on a local approximation for the fitness function.
We demonstrate our idea in the context of a state-of-the-art ES method called the Augmented Random Search (ARS)
arXiv Detail & Related papers (2022-08-22T20:29:20Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z) - Efficient Evaluation of Natural Stochastic Policies in Offline
Reinforcement Learning [80.42316902296832]
We study the efficient off-policy evaluation of natural policies, which are defined in terms of deviations from the behavior policy.
This is a departure from the literature on off-policy evaluation where most work consider the evaluation of explicitly specified policies.
arXiv Detail & Related papers (2020-06-06T15:08:24Z) - Policy Evaluation Networks [50.53250641051648]
We introduce a scalable, differentiable fingerprinting mechanism that retains essential policy information in a concise embedding.
Our empirical results demonstrate that combining these three elements can produce policies that outperform those that generated the training data.
arXiv Detail & Related papers (2020-02-26T23:00:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.