Related papers: Contextual Bandit with Missing Rewards

Contextual Bandit with Missing Rewards

URL: http://arxiv.org/abs/2007.06368v2
Date: Sun, 19 Jul 2020 00:16:49 GMT
Title: Contextual Bandit with Missing Rewards
Authors: Djallel Bouneffouf, Sohini Upadhyay and Yasaman Khazaeni
Abstract summary: We consider a novel variant of the contextual bandit problem where the reward associated with each context-based decision may not always be observed. This new problem is motivated by certain online settings including clinical trial and ad recommendation applications. We propose to combine the standard contextual bandit approach with an unsupervised learning mechanism such as clustering.
Score: 27.066965426355257
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider a novel variant of the contextual bandit problem (i.e., the multi-armed bandit with side-information, or context, available to a decision-maker) where the reward associated with each context-based decision may not always be observed("missing rewards"). This new problem is motivated by certain online settings including clinical trial and ad recommendation applications. In order to address the missing rewards setting, we propose to combine the standard contextual bandit approach with an unsupervised learning mechanism such as clustering. Unlike standard contextual bandit methods, by leveraging clustering to estimate missing reward, we are able to learn from each incoming event, even those with missing rewards. Promising empirical results are obtained on several real-life datasets.

Related papers

Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms. We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z)
Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set. Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity. For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z)
$\alpha$-Fair Contextual Bandits [10.74025233418392]
Contextual bandit algorithms are at the core of many applications, including recommender systems, clinical trials, and optimal portfolio selection. One of the most popular problems studied in the contextual bandit literature is to maximize the sum of the rewards in each round. In this paper, we consider the $alpha$-Fairtextual Con Bandits problem, where the objective is to maximize the global $alpha$-fair utility function.
arXiv Detail & Related papers (2023-10-22T03:42:59Z)
Bandit Social Learning: Exploration under Myopic Behavior [58.75758600464338]
We study social learning dynamics motivated by reviews on online platforms. Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z)
Contextual bandits with concave rewards, and an application to fair ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR) We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z)
DORB: Dynamically Optimizing Multiple Rewards with Bandits [101.68525259222164]
Policy-based reinforcement learning has proven to be a promising approach for optimizing non-differentiable evaluation metrics for language generation tasks. We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit) We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks.
arXiv Detail & Related papers (2020-11-15T21:57:47Z)
Online Semi-Supervised Learning in Contextual Bandits with Episodic Reward [13.173307471333619]
We introduce Background Episodic Reward LinUCB (UCB), a solution that easily incorporates clustering as a self-supervision module. Our experiments on a variety of datasets, both in stationary and nonstationary environments of six different scenarios, demonstrated clear advantages of the proposed approach.
arXiv Detail & Related papers (2020-09-17T20:41:02Z)
Online learning with Corrupted context: Corrupted Contextual Bandits [19.675277307158435]
We consider a novel variant of the contextual bandit problem. This problem is motivated by certain on-line settings including clinical trial and ad recommendation applications. We propose to combine the standard contextual bandit approach with a classical multi-armed bandit mechanism.
arXiv Detail & Related papers (2020-06-26T19:53:26Z)
Hierarchical Adaptive Contextual Bandits for Resource Constraint based Recommendation [49.69139684065241]
Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems. In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint.
arXiv Detail & Related papers (2020-04-02T17:04:52Z)
Self-Supervised Contextual Bandits in Computer Vision [4.165029665035158]
Contextual bandits are a common problem faced by machine learning practitioners. We propose a novel approach to tackle this issue by combining a contextual bandit objective with a self supervision objective. Our results on eight popular computer vision datasets show substantial gains in cumulative reward.
arXiv Detail & Related papers (2020-03-18T22:06:34Z)
Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean. We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.