The Survival Bandit Problem
- URL: http://arxiv.org/abs/2206.03019v4
- Date: Sat, 6 Jan 2024 05:50:44 GMT
- Title: The Survival Bandit Problem
- Authors: Charles Riou and Junya Honda and Masashi Sugiyama
- Abstract summary: We introduce and study a new variant of the multi-armed bandit problem (MAB), called the survival bandit problem (S-MAB)
While in both problems, the objective is to maximize the so-called cumulative reward, in this new variant, the procedure is interrupted if the cumulative reward falls below a preset threshold.
This simple yet unexplored extension of the MAB follows from many practical applications.
- Score: 65.68378556428861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce and study a new variant of the multi-armed bandit problem (MAB),
called the survival bandit problem (S-MAB). While in both problems, the
objective is to maximize the so-called cumulative reward, in this new variant,
the procedure is interrupted if the cumulative reward falls below a preset
threshold. This simple yet unexplored extension of the MAB follows from many
practical applications. For example, when testing two medicines against each
other on voluntary patients, people's health are at stake, and it is necessary
to be able to interrupt experiments if serious side effects occur or if the
disease syndromes are not dissipated by the treatment. From a theoretical
perspective, the S-MAB is the first variant of the MAB where the procedure may
or may not be interrupted. We start by formalizing the S-MAB and we define its
objective as the minimization of the so-called survival regret, which naturally
generalizes the regret of the MAB. Then, we show that the objective of the
S-MAB is considerably more difficult than the MAB, in the sense that contrary
to the MAB, no policy can achieve a reasonably small (i.e., sublinear) survival
regret. Instead, we minimize the survival regret in the sense of Pareto, i.e.,
we seek a policy whose cumulative reward cannot be improved for some problem
instance without being sacrificed for another one. For that purpose, we
identify two key components in the survival regret: the regret given no ruin
(which corresponds to the regret in the MAB), and the probability that the
procedure is interrupted, called the probability of ruin. We derive a lower
bound on the probability of ruin, as well as policies whose probability of ruin
matches the lower bound. Finally, based on a doubling trick on those policies,
we derive a policy which minimizes the survival regret in the sense of Pareto,
giving an answer to an open problem by Perotto et al. (COLT 2019).
Related papers
- Survival Multiarmed Bandits with Bootstrapping Methods [0.0]
The Survival Multiarmed Bandits (S-MAB) problem is an extension which constrains an agent to a budget related to observed rewards.
This paper presents a framework that addresses such a dual goal using an objective function balanced by a ruin aversion component.
arXiv Detail & Related papers (2024-10-21T20:21:10Z) - Best Arm Identification with Minimal Regret [55.831935724659175]
Best arm identification problem elegantly amalgamates regret minimization and BAI.
Agent's goal is to identify the best arm with a prescribed confidence level.
Double KL-UCB algorithm achieves optimality as the confidence level tends to zero.
arXiv Detail & Related papers (2024-09-27T16:46:02Z) - Causal Bandits: The Pareto Optimal Frontier of Adaptivity, a Reduction to Linear Bandits, and Limitations around Unknown Marginals [28.94461817548213]
We prove upper and matching lower bounds on the possible trade-offs in the performance of learning in conditionally benign and arbitrary environments.
We are the first to obtain instance-dependent bounds for causal bandits, by reducing the problem to the linear bandit setting.
arXiv Detail & Related papers (2024-07-01T04:12:15Z) - Multi-Agent Imitation Learning: Value is Easy, Regret is Hard [52.31989962031179]
We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents.
Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations.
While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee to deviations by strategic agents.
arXiv Detail & Related papers (2024-06-06T16:18:20Z) - Refining Minimax Regret for Unsupervised Environment Design [15.281908507614512]
We introduce level-perfect MMR, a refinement of the minimax regret objective.
We show that BLP policies act consistently with a Perfect Bayesian policy over all levels.
We also introduce an algorithm, ReMiDi, that results in a BLP policy at convergence.
arXiv Detail & Related papers (2024-02-19T16:51:29Z) - One Arrow, Two Kills: An Unified Framework for Achieving Optimal Regret
Guarantees in Sleeping Bandits [29.896865106960423]
We address the problem of emphInternal Regret' in emphSleeping Bandits in the fully adversarial setup.
We propose an algorithm that yields sublinear regret in that measure, even for a completely adversarial sequence of losses and availabilities.
arXiv Detail & Related papers (2022-10-26T19:40:06Z) - A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs [117.82903457289584]
We derive a novel problem-dependent lower-bound for regret in finite-horizon Markov Decision Processes (MDPs)
We show that our lower-bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all.
We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
arXiv Detail & Related papers (2021-06-24T13:46:09Z) - Empirical or Invariant Risk Minimization? A Sample Complexity
Perspective [49.43806345820883]
It is unclear when in-variant risk generalization (IRM) should be preferred over the widely-employed empirical risk minimization (ERM) framework.
We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and behavior.
We further investigate how different factors -- the number of environments, complexity of the model, and IRM penalty weight -- impact the sample complexity of IRM in relation to its distance from the OOD solutions.
arXiv Detail & Related papers (2020-10-30T17:55:30Z) - A Deep Q-learning/genetic Algorithms Based Novel Methodology For
Optimizing Covid-19 Pandemic Government Actions [63.669642197519934]
We use the SEIR epidemiological model to represent the evolution of the virus COVID-19 over time in the population.
The sequences of actions (confinement, self-isolation, two-meter distance or not taking restrictions) are evaluated according to a reward system.
We prove that our methodology is a valid tool to discover actions governments can take to reduce the negative effects of a pandemic in both senses.
arXiv Detail & Related papers (2020-05-15T17:17:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.