Off-Policy Evaluation of Slate Bandit Policies via Optimizing
Abstraction
- URL: http://arxiv.org/abs/2402.02171v2
- Date: Sat, 17 Feb 2024 17:35:35 GMT
- Title: Off-Policy Evaluation of Slate Bandit Policies via Optimizing
Abstraction
- Authors: Haruka Kiyohara, Masahiro Nomura, Yuta Saito
- Abstract summary: We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates.
The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces.
We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
- Score: 22.215852332444907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study off-policy evaluation (OPE) in the problem of slate contextual
bandits where a policy selects multi-dimensional actions known as slates. This
problem is widespread in recommender systems, search engines, marketing, to
medical applications, however, the typical Inverse Propensity Scoring (IPS)
estimator suffers from substantial variance due to large action spaces, making
effective OPE a significant challenge. The PseudoInverse (PI) estimator has
been introduced to mitigate the variance issue by assuming linearity in the
reward function, but this can result in significant bias as this assumption is
hard-to-verify from observed data and is often substantially violated. To
address the limitations of previous estimators, we develop a novel estimator
for OPE of slate bandits, called Latent IPS (LIPS), which defines importance
weights in a low-dimensional slate abstraction space where we optimize slate
abstractions to minimize the bias and variance of LIPS in a data-driven way. By
doing so, LIPS can substantially reduce the variance of IPS without imposing
restrictive assumptions on the reward function structure like linearity.
Through empirical evaluation, we demonstrate that LIPS substantially
outperforms existing estimators, particularly in scenarios with non-linear
rewards and large slate spaces.
Related papers
- Causal Deepsets for Off-policy Evaluation under Spatial or Spatio-temporal Interferences [24.361550505778155]
Offcommerce evaluation (OPE) is widely applied in sectors such as pharmaceuticals and e-policy-policy.
This paper introduces a causal deepset framework that relaxes several key structural assumptions.
We present novel algorithms that incorporate the PI assumption into OPE and thoroughly examine their theoretical foundations.
arXiv Detail & Related papers (2024-07-25T10:02:11Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Exponential Smoothing for Off-Policy Learning [16.284314586358928]
We derive a two-sided PAC-Bayes generalization bound for inverse propensity scoring (IPS)
The bound is tractable, scalable, interpretable and provides learning certificates.
arXiv Detail & Related papers (2023-05-25T09:18:45Z) - Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning.
Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z) - Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems.
Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation.
We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z) - Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems.
Existing OPE estimators degrade severely when the number of actions is large.
We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z) - Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks [59.419152768018506]
We show that any optimal policy necessarily satisfies the k-SP constraint.
We propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it.
Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO)
arXiv Detail & Related papers (2021-07-13T21:39:21Z) - Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions.
We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.