Learning from eXtreme Bandit Feedback
- URL: http://arxiv.org/abs/2009.12947v2
- Date: Mon, 22 Feb 2021 22:58:15 GMT
- Title: Learning from eXtreme Bandit Feedback
- Authors: Romain Lopez and Inderjit S. Dhillon and Michael I. Jordan
- Abstract summary: We study the problem of batch learning from bandit feedback in the setting of extremely large action spaces.
In this paper, we introduce a selective importance sampling estimator (sIS) that operates in a significantly more favorable bias-variance regime.
We employ this estimator in a novel algorithmic procedure -- named Policy Optimization for eXtreme Models (POXM) -- for learning from bandit feedback on XMC tasks.
- Score: 105.0383130431503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of batch learning from bandit feedback in the setting of
extremely large action spaces. Learning from extreme bandit feedback is
ubiquitous in recommendation systems, in which billions of decisions are made
over sets consisting of millions of choices in a single day, yielding massive
observational data. In these large-scale real-world applications, supervised
learning frameworks such as eXtreme Multi-label Classification (XMC) are widely
used despite the fact that they incur significant biases due to the mismatch
between bandit feedback and supervised labels. Such biases can be mitigated by
importance sampling techniques, but these techniques suffer from impractical
variance when dealing with a large number of actions. In this paper, we
introduce a selective importance sampling estimator (sIS) that operates in a
significantly more favorable bias-variance regime. The sIS estimator is
obtained by performing importance sampling on the conditional expectation of
the reward with respect to a small subset of actions for each instance (a form
of Rao-Blackwellization). We employ this estimator in a novel algorithmic
procedure -- named Policy Optimization for eXtreme Models (POXM) -- for
learning from bandit feedback on XMC tasks. In POXM, the selected actions for
the sIS estimator are the top-p actions of the logging policy, where p is
adjusted from the data and is significantly smaller than the size of the action
space. We use a supervised-to-bandit conversion on three XMC datasets to
benchmark our POXM method against three competing methods: BanditNet, a
previously applied partial matching pruning strategy, and a supervised learning
baseline. Whereas BanditNet sometimes improves marginally over the logging
policy, our experiments show that POXM systematically and significantly
improves over all baselines.
Related papers
- Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning.
One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems.
We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Off-Policy Evaluation of Slate Bandit Policies via Optimizing
Abstraction [22.215852332444907]
We study the problem of slate contextual bandits where a policy selects multi-dimensional actions known as slates.
The typical Inverse Propensity Scoring (IPS) estimator suffers from substantial variance due to large action spaces.
We develop a novel estimator for OPE of slate bandits, called Latent IPS (LIPS), which defines importance weights in a low-dimensional slate abstraction space.
arXiv Detail & Related papers (2024-02-03T14:38:09Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Learning Action Embeddings for Off-Policy Evaluation [6.385697591955264]
Off-policy evaluation (OPE) methods allow us to compute the expected reward of a policy by using the logged data collected by a different policy.
But when the number of actions is large, or certain actions are under-explored by the logging policy, existing estimators based on inverse-propensity scoring (IPS) can have a high or even infinite variance.
Saito and Joachims propose marginalized IPS (MIPS) that uses action embeddings instead, which reduces the variance of IPS in large action spaces.
arXiv Detail & Related papers (2023-05-06T06:44:30Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Bayesian Non-stationary Linear Bandits for Large-Scale Recommender
Systems [6.009759445555003]
We build upon the linear contextual multi-armed bandit framework to address this problem.
We develop a decision-making policy for a linear bandit problem with high-dimensional feature vectors.
Our proposed recommender system employs this policy to learn the users' item preferences online while minimizing runtime.
arXiv Detail & Related papers (2022-02-07T13:51:19Z) - Control Variates for Slate Off-Policy Evaluation [112.35528337130118]
We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions.
We obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators.
arXiv Detail & Related papers (2021-06-15T06:59:53Z) - Continuous Mean-Covariance Bandits [39.820490484375156]
We propose a novel Continuous Mean-Covariance Bandit model to take into account option correlation.
In CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions.
We propose novel algorithms with optimal regrets (within logarithmic factors) and provide matching lower bounds to validate their optimalities.
arXiv Detail & Related papers (2021-02-24T06:37:05Z) - Output-Weighted Sampling for Multi-Armed Bandits with Extreme Payoffs [11.1546439770774]
We present a new type of acquisition functions for online decision making in bandit problems with extreme payoffs.
We formulate a novel type of upper confidence bound (UCB) acquisition function that guides exploration towards the bandits that are deemed most relevant.
arXiv Detail & Related papers (2021-02-19T18:36:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.