Introduction to Multi-Armed Bandits
- URL: http://arxiv.org/abs/1904.07272v8
- Date: Wed, 3 Apr 2024 21:32:42 GMT
- Title: Introduction to Multi-Armed Bandits
- Authors: Aleksandrs Slivkins,
- Abstract summary: Multi-armed bandits is a framework for algorithms that make decisions over time under uncertainty.
This book provides a more introductory, textbook-like treatment of the subject.
- Score: 58.87314871998078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a brief review of the further developments; many of the chapters conclude with exercises. The book is structured as follows. The first four chapters are on IID rewards, from the basic model to impossibility results to Bayesian priors to Lipschitz rewards. The next three chapters cover adversarial rewards, from the full-feedback version to adversarial bandits to extensions with linear rewards and combinatorially structured actions. Chapter 8 is on contextual bandits, a middle ground between IID and adversarial bandits in which the change in reward distributions is completely explained by observable contexts. The last three chapters cover connections to economics, from learning in repeated games to bandits with supply/budget constraints to exploration in the presence of incentives. The appendix provides sufficient background on concentration and KL-divergence. The chapters on "bandits with similarity information", "bandits with knapsacks" and "bandits and agents" can also be consumed as standalone surveys on the respective topics.
Related papers
- Survival Multiarmed Bandits with Bootstrapping Methods [0.0]
The Survival Multiarmed Bandits (S-MAB) problem is an extension which constrains an agent to a budget related to observed rewards.
This paper presents a framework that addresses such a dual goal using an objective function balanced by a ruin aversion component.
arXiv Detail & Related papers (2024-10-21T20:21:10Z) - Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting [67.1631453378926]
Graph-Triggered Bandits is a framework to generalize rested and restless bandits.
In this work, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs.
arXiv Detail & Related papers (2024-09-09T18:23:07Z) - Distributed No-Regret Learning for Multi-Stage Systems with End-to-End Bandit Feedback [7.8539454948826375]
This paper studies multi-stage systems with end-to-end bandit feedback.
Each job needs to go through multiple stages, each managed by a different agent, before generating an outcome.
The goal of this paper is to develop distributed online learning algorithms that achieve sublinear regret in adversarial environments.
arXiv Detail & Related papers (2024-04-06T05:34:12Z) - On the Robustness of Epoch-Greedy in Multi-Agent Contextual Bandit
Mechanisms [0.7734726150561088]
We show that the most prominent contextual bandit algorithm, $epsilon$-greedy can be extended to handle the challenges introduced by strategic arms.
We also show that $epsilon$-greedy is inherently robust to adversarial data corruption attacks and achieves performance that degrades linearly with the amount of corruption.
arXiv Detail & Related papers (2023-07-15T01:20:31Z) - Bandit Social Learning: Exploration under Myopic Behavior [58.75758600464338]
We study social learning dynamics motivated by reviews on online platforms.
Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration.
We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z) - Hierarchical Bayesian Bandits [51.67132887113412]
We analyze a natural hierarchical Thompson sampling algorithm (hierTS) that can be applied to any problem in this class.
Our regret bounds hold under many instances of such problems, including when the tasks are solved sequentially or in parallel.
Experiments show that the hierarchical structure helps with knowledge sharing among the tasks.
arXiv Detail & Related papers (2021-11-12T20:33:09Z) - Multi-facet Contextual Bandits: A Neural Network Perspective [34.96188300126833]
We study a novel problem of multi-facet bandits involving a group of bandits, each characterizing the users' needs from one unique aspect.
In each round, for the given user, we need to select one arm from each bandit, such that the combination of all arms maximizes the final reward.
This problem can find immediate applications in E-commerce, healthcare, etc.
arXiv Detail & Related papers (2021-06-06T05:48:44Z) - Contextual Blocking Bandits [35.235375147227124]
We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards.
Playing an arm blocks it (across all contexts) for a fixed and known number of future time steps.
We propose a UCB-based variant of the full-information algorithm that guarantees a $mathcalO(log T)$-regret w.r.t. an $alpha$regret strategy in $T time steps, matching the $Omega(log(T)$ lower bound
arXiv Detail & Related papers (2020-03-06T20:34:42Z) - The Price of Incentivizing Exploration: A Characterization via Thompson
Sampling and Sample Complexity [83.81297078039836]
We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents.
We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility.
arXiv Detail & Related papers (2020-02-03T04:58:51Z) - Bandits with Knapsacks beyond the Worst-Case [87.54497614804409]
We present three results that go beyond the worst-case perspective.
First, we provide upper and lower bounds which amount to a full characterization for logarithmic, instance-dependent regret rates.
Second, we consider "simple regret" in BwK, which tracks algorithm's performance in a given round, and prove that it is small in all but a few rounds.
arXiv Detail & Related papers (2020-02-01T18:50:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.