Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards
- URL: http://arxiv.org/abs/2510.14884v1
- Date: Thu, 16 Oct 2025 17:01:57 GMT
- Title: Learning When Not to Learn: Risk-Sensitive Abstention in Bandits with Unbounded Rewards
- Authors: Sarah Liaw, Benjamin Plaut,
- Abstract summary: In high-stakes AI applications, even a single action can cause irreparable damage.<n>Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails.<n>We propose a caution-based algorithm that learns when not to learn.
- Score: 5.006086647446482
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In high-stakes AI applications, even a single action can cause irreparable damage. However, nearly all of sequential decision-making theory assumes that all errors are recoverable (e.g., by bounding rewards). Standard bandit algorithms that explore aggressively may cause irreparable damage when this assumption fails. Some prior work avoids irreparable errors by asking for help from a mentor, but a mentor may not always be available. In this work, we formalize a model of learning with unbounded rewards without a mentor as a two-action contextual bandit with an abstain option: at each round the agent observes an input and chooses either to abstain (always 0 reward) or to commit (execute a preexisting task policy). Committing yields rewards that are upper-bounded but can be arbitrarily negative, and the commit reward is assumed Lipschitz in the input. We propose a caution-based algorithm that learns when not to learn: it chooses a trusted region and commits only where the available evidence does not already certify harm. Under these conditions and i.i.d. inputs, we establish sublinear regret guarantees, theoretically demonstrating the effectiveness of cautious exploration for deploying learning agents safely in high-stakes environments.
Related papers
- Capability-Oriented Training Induced Alignment Risk [101.37328448441208]
We investigate whether language models, when trained with reinforcement learning, will spontaneously learn to exploit flaws to maximize their reward.<n>Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety.<n>Our findings suggest that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves.
arXiv Detail & Related papers (2026-02-12T16:13:14Z) - Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation [12.503662455234954]
We show that modern language models produce confident hallucinations even when wrong answers carry catastrophic consequences.<n>We propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards instead of binary.
arXiv Detail & Related papers (2025-11-14T17:20:45Z) - Safe Learning Under Irreversible Dynamics via Asking for Help [13.369079495587693]
Most learning algorithms with formal regret guarantees essentially rely on trying all possible behaviors.<n>We show that this combination enables the agent to learn both safely and effectively.<n>Our result may be the first formal proof that it is possible for an agent to obtain high reward while becoming self-sufficient.
arXiv Detail & Related papers (2025-02-19T19:01:39Z) - Can a Bayesian Oracle Prevent Harm from an Agent? [48.12936383352277]
We consider estimating a context-dependent bound on the probability of violating a given safety specification.<n>Noting that different plausible hypotheses about the world could produce very different outcomes, we derive on the safety violation probability predicted under the true but unknown hypothesis.<n>We consider two forms of this result, in the i.i.d. case and in the non-i.i.d. case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.
arXiv Detail & Related papers (2024-08-09T18:10:42Z) - The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret [64.04721528586747]
We show that a sufficiently low expected test error of the reward model guarantees low worst-case regret.<n>We then show that similar problems persist even when using policy regularization techniques.
arXiv Detail & Related papers (2024-06-22T06:43:51Z) - Avoiding Catastrophe in Online Learning by Asking for Help [7.881265948305421]
We propose an online learning problem where the goal is to minimize the chance of catastrophe.<n>We first show that in general, any algorithm either queries the mentor at a linear rate or is nearly guaranteed to cause catastrophe.<n>We provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows.
arXiv Detail & Related papers (2024-02-12T21:12:11Z) - The Risks of Recourse in Binary Classification [10.067421338825545]
We study whether providing algorithmic recourse is beneficial or harmful at the population level.
We find that there are many plausible scenarios in which providing recourse turns out to be harmful.
We conclude that the current concept of algorithmic recourse is not reliably beneficial, and therefore requires rethinking.
arXiv Detail & Related papers (2023-06-01T09:46:43Z) - Bandit Social Learning: Exploration under Myopic Behavior [54.767961587919075]
We study social learning dynamics motivated by reviews on online platforms.<n>Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration.<n>We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.