Contextual Bandit with Missing Rewards
- URL: http://arxiv.org/abs/2007.06368v2
- Date: Sun, 19 Jul 2020 00:16:49 GMT
- Title: Contextual Bandit with Missing Rewards
- Authors: Djallel Bouneffouf, Sohini Upadhyay and Yasaman Khazaeni
- Abstract summary: We consider a novel variant of the contextual bandit problem where the reward associated with each context-based decision may not always be observed.
This new problem is motivated by certain online settings including clinical trial and ad recommendation applications.
We propose to combine the standard contextual bandit approach with an unsupervised learning mechanism such as clustering.
- Score: 27.066965426355257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider a novel variant of the contextual bandit problem (i.e., the
multi-armed bandit with side-information, or context, available to a
decision-maker) where the reward associated with each context-based decision
may not always be observed("missing rewards"). This new problem is motivated by
certain online settings including clinical trial and ad recommendation
applications. In order to address the missing rewards setting, we propose to
combine the standard contextual bandit approach with an unsupervised learning
mechanism such as clustering. Unlike standard contextual bandit methods, by
leveraging clustering to estimate missing reward, we are able to learn from
each incoming event, even those with missing rewards. Promising empirical
results are obtained on several real-life datasets.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Information Capacity Regret Bounds for Bandits with Mediator Feedback [55.269551124587224]
We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set.
Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity.
For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
arXiv Detail & Related papers (2024-02-15T19:18:47Z) - $\alpha$-Fair Contextual Bandits [10.74025233418392]
Contextual bandit algorithms are at the core of many applications, including recommender systems, clinical trials, and optimal portfolio selection.
One of the most popular problems studied in the contextual bandit literature is to maximize the sum of the rewards in each round.
In this paper, we consider the $alpha$-Fairtextual Con Bandits problem, where the objective is to maximize the global $alpha$-fair utility function.
arXiv Detail & Related papers (2023-10-22T03:42:59Z) - Bandit Social Learning: Exploration under Myopic Behavior [58.75758600464338]
We study social learning dynamics motivated by reviews on online platforms.
Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration.
We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z) - DORB: Dynamically Optimizing Multiple Rewards with Bandits [101.68525259222164]
Policy-based reinforcement learning has proven to be a promising approach for optimizing non-differentiable evaluation metrics for language generation tasks.
We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit)
We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks.
arXiv Detail & Related papers (2020-11-15T21:57:47Z) - Online learning with Corrupted context: Corrupted Contextual Bandits [19.675277307158435]
We consider a novel variant of the contextual bandit problem.
This problem is motivated by certain on-line settings including clinical trial and ad recommendation applications.
We propose to combine the standard contextual bandit approach with a classical multi-armed bandit mechanism.
arXiv Detail & Related papers (2020-06-26T19:53:26Z) - Hierarchical Adaptive Contextual Bandits for Resource Constraint based
Recommendation [49.69139684065241]
Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems.
In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint.
arXiv Detail & Related papers (2020-04-02T17:04:52Z) - Self-Supervised Contextual Bandits in Computer Vision [4.165029665035158]
Contextual bandits are a common problem faced by machine learning practitioners.
We propose a novel approach to tackle this issue by combining a contextual bandit objective with a self supervision objective.
Our results on eight popular computer vision datasets show substantial gains in cumulative reward.
arXiv Detail & Related papers (2020-03-18T22:06:34Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.