Related papers: Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction

URL: http://arxiv.org/abs/2402.02171v2
Date: Sat, 17 Feb 2024 17:35:35 GMT
Title: Off-Policy Evaluation of Slate Bandit Policies via Optimizing Abstraction
Authors: Haruka Kiyohara, Masahiro Nomura, Yuta Saito
Abstract summary: We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces. We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
Score: 22.215852332444907
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study off-policy evaluation (OPE) in the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates. This problem is widespread in recommender systems, search engines, marketing, to medical applications, however, the typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces, making effective OPE a significant challenge. The PseudoInverse (PI) estimator has been introduced to mitigate the variance issue by assuming linearity in the reward function, but this can result in significant bias as this assumption is hard-to-verify from observed data and is often substantially violated. To address the limitations of previous estimators, we develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space where we optimize slate abstractions to minimize the bias and variance of LIPS in a data-driven way. By doing so, LIPS can substantially reduce the variance of IPS without imposing restrictive assumptions on the reward function structure like linearity. Through empirical evaluation, we demonstrate that LIPS substantially outperforms existing estimators, particularly in scenarios with non-linear rewards and large slate spaces.

Related papers

Difficulty-Estimated Policy Optimization [38.86673795561421]
We propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment.<n>Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling.
arXiv Detail & Related papers (2026-02-06T04:12:23Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits [3.5219188193742563]
Inverse Propensity Score (IPS) weighting suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored.<n>Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings.<n>We introduce Context-Action Embedding Learning for MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator.
arXiv Detail & Related papers (2025-08-31T00:55:55Z)
Counterfactual Risk Minimization with IPS-Weighted BPR and Self-Normalized Evaluation in Recommender Systems [3.5507492850515323]
inverse propensity scoring (IPS) corrects this bias, but it often suffers from high variance and instability.<n>We present a simple and effective pipeline that integrates IPS-weighted training with an IPS-weighted Bayesian Personalized Ranking objective.<n> Experiments on synthetic and MovieLens 100K data show that our approach generalizes better under unbiased exposure.
arXiv Detail & Related papers (2025-08-30T03:14:56Z)
Off-Policy Evaluation of Ranking Policies via Embedding-Space User Behavior Modeling [0.0]
Off-policy evaluation in ranking settings with large ranking action spaces is essential for assessing new recommender policies.<n>We introduce two new assumptions: no direct effect on rankings and user behavior model on ranking embedding spaces.<n>We then propose the generalized marginalized inverse propensity score estimator with statistically desirable properties.
arXiv Detail & Related papers (2025-05-31T07:58:53Z)
Causal Deepsets for Off-policy Evaluation under Spatial or Spatio-temporal Interferences [24.361550505778155]
Offcommerce evaluation (OPE) is widely applied in sectors such as pharmaceuticals and e-policy-policy. This paper introduces a causal deepset framework that relaxes several key structural assumptions. We present novel algorithms that incorporate the PI assumption into OPE and thoroughly examine their theoretical foundations.
arXiv Detail & Related papers (2024-07-25T10:02:11Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Exponential Smoothing for Off-Policy Learning [16.284314586358928]
We derive a two-sided PAC-Bayes generalization bound for inverse propensity scoring (IPS) The bound is tractable, scalable, interpretable and provides learning certificates.
arXiv Detail & Related papers (2023-05-25T09:18:45Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z)
Off-Policy Evaluation for Large Action Spaces via Embeddings [36.42838320396534]
Off-policy evaluation (OPE) in contextual bandits has seen rapid adoption in real-world systems. Existing OPE estimators degrade severely when the number of actions is large. We propose a new OPE estimator that leverages marginalized importance weights when action embeddings provide structure in the action space.
arXiv Detail & Related papers (2022-02-13T14:00:09Z)
Shortest-Path Constrained Reinforcement Learning for Sparse Reward Tasks [59.419152768018506]
We show that any optimal policy necessarily satisfies the k-SP constraint. We propose a novel cost function that penalizes the policy violating SP constraint, instead of completely excluding it. Our experiments on MiniGrid, DeepMind Lab, Atari, and Fetch show that the proposed method significantly improves proximal policy optimization (PPO)
arXiv Detail & Related papers (2021-07-13T21:39:21Z)
Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions. We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.