On Under-exploration in Bandits with Mean Bounds from Confounded Data
- URL: http://arxiv.org/abs/2002.08405v4
- Date: Thu, 10 Jun 2021 14:55:17 GMT
- Title: On Under-exploration in Bandits with Mean Bounds from Confounded Data
- Authors: Nihal Sharma, Soumya Basu, Karthikeyan Shanmugam and Sanjay Shakkottai
- Abstract summary: We study a variant of the multi-armed bandit problem where side information in the form of bounds on the mean of each arm is provided.
We develop the novel non-optimistic Global Under-Explore (GLUE) algorithm which uses the provided mean bounds.
We show that mean bounds can be inferred naturally from such logs and can thus be used to improve the learning process.
- Score: 41.09427248084205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a variant of the multi-armed bandit problem where side information
in the form of bounds on the mean of each arm is provided. We develop the novel
non-optimistic Global Under-Explore (GLUE) algorithm which uses the provided
mean bounds (across all the arms) to infer pseudo-variances for each arm, which
in turn decide the rate of exploration for the arms. We analyze the regret of
GLUE and prove regret upper bounds that are never worse than that of the
standard UCB algorithm. Furthermore, we show that GLUE improves upon regret
guarantees that exists in literature for structured bandit problems (both
theoretically and empirically). Finally, we study the practical setting of
learning adaptive interventions using prior data that has been confounded by
unrecorded variables that affect rewards. We show that mean bounds can be
inferred naturally from such logs and can thus be used to improve the learning
process. We validate our findings through semi-synthetic experiments on data
derived from real data sets.
Related papers
- Neural Dueling Bandits [58.90189511247936]
We use a neural network to estimate the reward function using preference feedback for the previously selected arms.
We then extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution.
arXiv Detail & Related papers (2024-07-24T09:23:22Z) - Thompson Sampling in Partially Observable Contextual Bandits [2.465689259704613]
We study bandit policies for learning to select optimal arms based on the data of observations.
Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation.
These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.
arXiv Detail & Related papers (2024-02-15T19:37:39Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - Tighter Information-Theoretic Generalization Bounds from Supersamples [27.14107452619853]
We present a variety of novel information-theoretic generalization bounds for learning algorithms.
The presented bounds include square-root bounds, fast-rate bounds, including those based on variance and sharpness.
We show theoretically or empirically that these bounds are tighter than all information-theoretic bounds known to date on the same supersample setting.
arXiv Detail & Related papers (2023-02-05T17:06:27Z) - Assaying Out-Of-Distribution Generalization in Transfer Learning [103.57862972967273]
We take a unified view of previous work, highlighting message discrepancies that we address empirically.
We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting.
arXiv Detail & Related papers (2022-07-19T12:52:33Z) - From Optimality to Robustness: Dirichlet Sampling Strategies in
Stochastic Bandits [0.0]
We study a generic Dirichlet Sampling (DS) algorithm, based on pairwise comparisons of empirical indices computed with re-sampling of the arms' observations.
We show that different variants of this strategy achieve provably optimal regret guarantees when the distributions are bounded and logarithmic regret for semi-bounded distributions with a mild quantile condition.
arXiv Detail & Related papers (2021-11-18T14:34:21Z) - Pure Exploration in Multi-armed Bandits with Graph Side Information [11.633592964399806]
We study pure exploration in multi-armed bandits with graph side-information.
We propose a novel algorithm GRUB (GRaph based UcB) for this problem.
We show that capitalizing on available graph side information yields significant improvements over pure exploration methods.
arXiv Detail & Related papers (2021-08-02T20:06:04Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Stochastic Multi-Armed Bandits with Control Variates [6.548580592686076]
We study a new variant of the multi-armed bandits problem, where the learner has access to auxiliary information about the arms.
The auxiliary information is correlated with the arm rewards.
We develop an algorithm named UCB-CV that uses improved estimates.
arXiv Detail & Related papers (2021-05-09T15:40:09Z) - Information Directed Sampling for Linear Partial Monitoring [112.05623123909895]
We introduce information directed sampling (IDS) for partial monitoring with a linear reward and observation structure.
IDS achieves adaptive worst-case regret rates that depend on precise observability conditions of the game.
We extend our results to the contextual and the kernelized setting, which significantly increases the range of possible applications.
arXiv Detail & Related papers (2020-02-25T21:30:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.