Pessimistic Off-Policy Optimization for Learning to Rank
- URL: http://arxiv.org/abs/2206.02593v1
- Date: Mon, 6 Jun 2022 12:58:28 GMT
- Title: Pessimistic Off-Policy Optimization for Learning to Rank
- Authors: Matej Cief, Branislav Kveton and Michal Kompan
- Abstract summary: Off-policy learning is a framework for optimizing policies without deploying them.
In recommender systems, this is especially challenging due to the imbalance in logged data.
We study pessimistic off-policy optimization for learning to rank.
- Score: 9.197878514042227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy learning is a framework for optimizing policies without deploying
them, using data collected by another policy. In recommender systems, this is
especially challenging due to the imbalance in logged data: some items are
recommended and thus logged much more frequently than others. This is further
perpetuated when recommending a list of items, as the action space is
combinatorial. To address this challenge, we study pessimistic off-policy
optimization for learning to rank. The key idea is to compute lower confidence
bounds on parameters of click models and then return the list with the highest
pessimistic estimate of its value. This approach is computationally efficient
and we analyze it. We study its Bayesian and frequentist variants, and overcome
the limitation of unknown prior by incorporating empirical Bayes. To show the
empirical effectiveness of our approach, we compare it to off-policy optimizers
that use inverse propensity scores or neglect uncertainty. Our approach
outperforms all baselines, is robust, and is also general.
Related papers
- Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Policy learning "without'' overlap: Pessimism and generalized empirical
Bernstein's inequality [107.84979976896912]
offline policy learning aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset.
We propose a new algorithm that optimize lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning
Approach [6.7826352751791985]
We propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting.
We integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism.
We develop the computational algorithm based on variational inference, which is highly efficient and scalable.
arXiv Detail & Related papers (2022-10-26T02:14:10Z) - Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions.
We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Robust Policy Search for Robot Navigation with Stochastic Meta-Policies [5.7871177330714145]
In this work, we exploit the main ingredients of Bayesian optimization to provide robustness to different issues for policy search algorithms.
We combine several methods and show how their interaction works better than the sum of the parts.
We compare the proposed algorithm with previous results in several optimization benchmarks and robot tasks, such as pushing objects with a robot arm, or path finding with a rover.
arXiv Detail & Related papers (2020-03-02T16:30:59Z) - Optimistic Policy Optimization with Bandit Feedback [70.75568142146493]
We propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $tilde O(sqrtS2 A H4 K)$ regret for previous rewards.
To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
arXiv Detail & Related papers (2020-02-19T15:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.