Related papers: Avoiding Catastrophe in Online Learning by Asking for Help

Avoiding Catastrophe in Online Learning by Asking for Help

URL: http://arxiv.org/abs/2402.08062v3
Date: Fri, 04 Oct 2024 15:23:43 GMT
Title: Avoiding Catastrophe in Online Learning by Asking for Help
Authors: Benjamin Plaut, Hanlin Zhu, Stuart Russell,
Abstract summary: We propose an online learning problem where the goal is to minimize the chance of catastrophe. We provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows.
Score: 7.881265948305421
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most learning algorithms with formal regret guarantees assume that no mistake is irreparable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes are \emph{catastrophic}, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe that round and aim to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We first show that in general, any algorithm either constantly queries the mentor or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online learning model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

Related papers

No-Regret Learning Under Adversarial Resource Constraints: A Spending Plan Is All You Need! [56.80767500991973]
We focus on two canonical settings: $(i)$ online resource allocation where rewards and costs are observed before action selection, and $(ii)$ online learning with resource constraints where they are observed after action selection, under full feedback or bandit feedback.<n>It is well known that achieving sublinear regret in these settings is impossible when reward and cost distributions may change arbitrarily over time.<n>We design general (primal-)dual methods that achieve sublinear regret with respect to baselines that follow the spending plan. Crucially, the performance of our algorithms improves when the spending plan ensures a well-balanced distribution of the budget
arXiv Detail & Related papers (2025-06-16T08:42:31Z)
Multi-Agent Imitation Learning: Value is Easy, Regret is Hard [52.31989962031179]
We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents. Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations. While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee to deviations by strategic agents.
arXiv Detail & Related papers (2024-06-06T16:18:20Z)
Online Learning with Unknown Constraints [10.263431543520452]
We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round.
arXiv Detail & Related papers (2024-03-06T20:23:59Z)
Truly No-Regret Learning in Constrained MDPs [61.78619476991494]
We propose a model-based primal-dual algorithm to learn in an unknown CMDP. We prove that our algorithm achieves sublinear regret without error cancellations.
arXiv Detail & Related papers (2024-02-24T09:47:46Z)
The Risks of Recourse in Binary Classification [10.067421338825545]
We study whether providing algorithmic recourse is beneficial or harmful at the population level. We find that there are many plausible scenarios in which providing recourse turns out to be harmful. We conclude that the current concept of algorithmic recourse is not reliably beneficial, and therefore requires rethinking.
arXiv Detail & Related papers (2023-06-01T09:46:43Z)
Agnostic Multi-Robust Learning Using ERM [19.313739782029185]
A fundamental problem in robust learning is asymmetry: a learner needs to correctly classify every one of exponentially-many perturbations that an adversary might make to a test-time natural example. In contrast, the attacker only needs to find one successful perturbation. We introduce a novel multi-group setting and introduce a novel multi-robust learning problem.
arXiv Detail & Related papers (2023-03-15T21:30:14Z)
You Only Live Once: Single-Life Reinforcement Learning [124.1738675154651]
In many real-world situations, the goal might not be to learn a policy that can do the task repeatedly, but simply to perform a new task successfully once in a single trial. We formalize this problem setting, where an agent must complete a task within a single episode without interventions. We propose an algorithm, $Q$-weighted adversarial learning (QWALE), which employs a distribution matching strategy.
arXiv Detail & Related papers (2022-10-17T09:00:11Z)
Learning to Be Cautious [71.9871661858886]
A key challenge in the field of reinforcement learning is to develop agents that behave cautiously in novel situations. We present a sequence of tasks where cautious behavior becomes increasingly non-obvious, as well as an algorithm to demonstrate that it is possible for a system to emphlearn to be cautious.
arXiv Detail & Related papers (2021-10-29T16:52:45Z)
Can Q-Learning be Improved with Advice? [27.24260290748049]
This paper addresses the question of whether worst-case lower bounds for regret can be circumvented in online learning of Markov decision processes (MDPs) We show that when predictions about the optimal $Q$-value function satisfy a reasonably weak condition we call distillation, then we can improve regret bounds by replacing the set of state-action pairs with the set of state-action pairs on which the predictions are grossly inaccurate. Our work extends a recent line of work on algorithms with predictions, which has typically focused on simple online problems such as caching and scheduling, to the more complex and general problem of reinforcement learning.
arXiv Detail & Related papers (2021-10-25T15:44:20Z)
Model-Free Online Learning in Unknown Sequential Decision Making Problems and Games [114.90723492840499]
In large two-player zero-sum imperfect-information games, modern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium. We formalize an online learning setting in which the strategy space is not known to the agent. We give an efficient algorithm that achieves $O(T3/4)$ regret with high probability for that setting, even when the agent faces an adversarial environment.
arXiv Detail & Related papers (2021-03-08T04:03:24Z)
Online Learning with Primary and Secondary Losses [29.540528603828722]
We study the problem of online learning with primary and secondary losses. We consider the question: Can we combine "expert advice" to achieve low regret with respect to the primary loss?
arXiv Detail & Related papers (2020-10-27T23:50:27Z)
Online non-convex optimization with imperfect feedback [33.80530308979131]
We consider the problem of online learning with non- losses. In terms of feedback, we assume that the learner observes - or otherwise constructs - an inexact model for the loss function at each stage. We propose a mixed-strategy learning policy based on dual averaging.
arXiv Detail & Related papers (2020-10-16T16:53:13Z)
Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z)
Excursion Search for Constrained Bayesian Optimization under a Limited Budget of Failures [62.41541049302712]
We propose a novel decision maker grounded in control theory that controls the amount of risk we allow in the search as a function of a given budget of failures. Our algorithm uses the failures budget more efficiently in a variety of optimization experiments, and generally achieves lower regret, than state-of-the-art methods.
arXiv Detail & Related papers (2020-05-15T09:54:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.