Off-Policy Evaluation of Slate Policies under Bayes Risk
- URL: http://arxiv.org/abs/2101.02553v1
- Date: Tue, 5 Jan 2021 20:07:56 GMT
- Title: Off-Policy Evaluation of Slate Policies under Bayes Risk
- Authors: Nikos Vlassis, Fernando Amat Gil, Ashok Chandrashekar
- Abstract summary: We study the problem of off-policy evaluation for slate bandits, for the typical case in which the logging policy factorizes over the slots of the slate.
We show that the risk improvement over PI grows linearly with the number of slots, and linearly with the gap between the arithmetic and the harmonic mean of a set of slot-level divergences.
- Score: 70.10677881866047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of off-policy evaluation for slate bandits, for the
typical case in which the logging policy factorizes over the slots of the
slate. We slightly depart from the existing literature by taking Bayes risk as
the criterion by which to evaluate estimators, and we analyze the family of
'additive' estimators that includes the pseudoinverse (PI) estimator of
Swaminathan et al.\ (2017; arXiv:1605.04812). Using a control variate approach,
we identify a new estimator in this family that is guaranteed to have lower
risk than PI in the above class of problems. In particular, we show that the
risk improvement over PI grows linearly with the number of slots, and linearly
with the gap between the arithmetic and the harmonic mean of a set of
slot-level divergences between the logging and the target policy. In the
typical case of a uniform logging policy and a deterministic target policy,
each divergence corresponds to slot size, showing that maximal gains can be
obtained for slate problems with diverse numbers of actions per slot.
Related papers
- Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set.
Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity.
For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - Cautious Reinforcement Learning via Distributional Risk in the Dual
Domain [45.17200683056563]
We study the estimation of risk-sensitive policies in reinforcement learning problems defined by a Markov Decision Process (MDPs) whose state and action spaces are countably finite.
We propose a new definition of risk, which we call caution, as a penalty function added to the dual objective of the linear programming (LP) formulation of reinforcement learning.
arXiv Detail & Related papers (2020-02-27T23:18:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.