Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk
- URL: http://arxiv.org/abs/2204.00706v1
- Date: Fri, 1 Apr 2022 22:08:03 GMT
- Title: Strategies for Safe Multi-Armed Bandits with Logarithmic Regret and Risk
- Authors: Tianrui Chen, Aditya Gangrade, Venkatesh Saligrama
- Abstract summary: We investigate a natural but surprisingly unstudied approach to the multi-armed bandit problem under safety risk constraints.
We formulate a pseudo-regret for this setting that enforces this safety constraint in a per-round way by softly penalising any violation.
This has practical relevance to scenarios such as clinical trials, where one must maintain safety for each round rather than in an aggregated sense.
- Score: 45.87122314291089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate a natural but surprisingly unstudied approach to the
multi-armed bandit problem under safety risk constraints. Each arm is
associated with an unknown law on safety risks and rewards, and the learner's
goal is to maximise reward whilst not playing unsafe arms, as determined by a
given threshold on the mean risk.
We formulate a pseudo-regret for this setting that enforces this safety
constraint in a per-round way by softly penalising any violation, regardless of
the gain in reward due to the same. This has practical relevance to scenarios
such as clinical trials, where one must maintain safety for each round rather
than in an aggregated sense.
We describe doubly optimistic strategies for this scenario, which maintain
optimistic indices for both safety risk and reward. We show that schema based
on both frequentist and Bayesian indices satisfy tight gap-dependent
logarithmic regret bounds, and further that these play unsafe arms only
logarithmically many times in total. This theoretical analysis is complemented
by simulation studies demonstrating the effectiveness of the proposed schema,
and probing the domains in which their use is appropriate.
Related papers
- Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states.
In safety-critical domains, such behaviors could lead to disastrous outcomes.
We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z) - Safeguarded Progress in Reinforcement Learning: Safe Bayesian
Exploration for Control Policy Synthesis [63.532413807686524]
This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL)
We propose a new architecture that handles the trade-off between efficient progress and safety during exploration.
arXiv Detail & Related papers (2023-12-18T16:09:43Z) - A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed
Bandits [0.0]
In areas of high volatility like healthcare or finance, a naive reward approach often does not accurately capture the complexity of the learning problem.
We propose a framework of adaptive risk-aware strategies that operate in non-stationary environments.
arXiv Detail & Related papers (2023-10-24T19:29:13Z) - Efficient Risk-Averse Reinforcement Learning [79.61412643761034]
In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns.
We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it.
We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks.
arXiv Detail & Related papers (2022-05-10T19:40:52Z) - SAAC: Safe Reinforcement Learning as an Adversarial Game of
Actor-Critics [11.132587007566329]
We develop a safe adversarially guided soft actor-critic framework called SAAC.
In SAAC, the adversary aims to break the safety constraint while the RL agent aims to maximize the constrained value function.
We show that SAAC achieves faster convergence, better efficiency, and fewer failures to satisfy the safety constraints.
arXiv Detail & Related papers (2022-04-20T12:32:33Z) - Best Arm Identification with Safety Constraints [3.7783523378336112]
The best arm identification problem in the multi-armed bandit setting is an excellent model of many real-world decision-making problems.
We study the question of best-arm identification in safety-critical settings, where the goal of the agent is to find the best safe option out of many.
We propose an algorithm in this setting which is guaranteed to learn safely.
arXiv Detail & Related papers (2021-11-23T20:53:12Z) - Learning to Act Safely with Limited Exposure and Almost Sure Certainty [1.0323063834827415]
This paper aims to put forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for exploratory trials.
We first focus on the canonical multi-armed bandit problem and seek to study the intrinsic trade-offs of learning safety in the presence of uncertainty.
arXiv Detail & Related papers (2021-05-18T18:05:12Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Learning under Invariable Bayesian Safety [36.96284975799963]
We adopt a model inspired by recent work on a bandit-like setting for recommendations.
We introduce a safety constraint that should be respected in every round and determines that the expected value in each round is above a given threshold.
arXiv Detail & Related papers (2020-06-08T12:07:59Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.