Policy Learning with Abstention
- URL: http://arxiv.org/abs/2510.19672v2
- Date: Sun, 09 Nov 2025 02:02:00 GMT
- Title: Policy Learning with Abstention
- Authors: Ayush Sawarni, Jikai Jin, Justin Whitehouse, Vasilis Syrgkanis,
- Abstract summary: We study policy learning with abstention, where a policy may defer to a safe default or an expert.<n>When a policy abstains, it receives a small additive reward on top of the value of a random guess.<n>We show that abstention is a versatile tool with direct applications to other core problems in policy learning.
- Score: 24.276267672386982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.
Related papers
- Customize Multi-modal RAI Guardrails with Precedent-based predictions [55.63757336900865]
A multi-modal guardrail must effectively filter image content based on user-defined policies.<n>Existing fine-tuning methods typically condition predictions on pre-defined policies.<n>We propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input.
arXiv Detail & Related papers (2025-07-28T03:45:34Z) - A universal policy wrapper with guarantees [0.0]
We introduce a universal policy wrapper for reinforcement learning agents.<n>Our wrapper selectively switches between a high-performing base policy and a fallback policy.<n>It operates without needing additional system knowledge or online constrained optimization.
arXiv Detail & Related papers (2025-05-18T10:37:27Z) - SNPL: Simultaneous Policy Learning and Evaluation for Safe Multi-Objective Policy Improvement [33.60500554561509]
To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data.<n>To provide credible recommendations, experimenters must not only identify policies that satisfy the desired changes in goal and guardrail outcomes, but also offer probabilistic guarantees about the changes these policies induce.<n>In this paper, we provide safe noisy policy learning (SNPL), a novel approach that leverages the concept of algorithmic stability to address these challenges.
arXiv Detail & Related papers (2025-03-17T02:53:53Z) - CSPI-MT: Calibrated Safe Policy Improvement with Multiple Testing for Threshold Policies [30.57323631122579]
We focus on threshold policies, a ubiquitous class of policies with applications in economics, healthcare, and digital advertising.
Existing methods rely on potentially underpowered safety checks and limit the opportunities for finding safe improvements.
We show that in adversarial settings, our approach controls the rate of adopting a policy worse than the baseline to the pre-specified error level.
arXiv Detail & Related papers (2024-08-21T21:38:03Z) - Conformal Off-Policy Evaluation in Markov Decision Processes [53.786439742572995]
Reinforcement Learning aims at identifying and evaluating efficient control policies from data.
Most methods for this learning task, referred to as Off-Policy Evaluation (OPE), do not come with accuracy and certainty guarantees.
We present a novel OPE method based on Conformal Prediction that outputs an interval containing the true reward of the target policy with a prescribed level of certainty.
arXiv Detail & Related papers (2023-04-05T16:45:11Z) - Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment [0.4999814847776098]
We examine a particular case of algorithmic pre-trial risk assessments in the US criminal justice system.<n>We analyze data from a unique field experiment on an algorithmic pre-trial risk assessment to investigate whether the scores and recommendations can be improved.
arXiv Detail & Related papers (2021-09-22T00:52:03Z) - Offline Policy Selection under Uncertainty [113.57441913299868]
We consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset.
Access to the full distribution over one's belief of the policy value enables more flexible selection algorithms under a wider range of downstream evaluation metrics.
We show how BayesDICE may be used to rank policies with respect to any arbitrary downstream policy selection metric.
arXiv Detail & Related papers (2020-12-12T23:09:21Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.