Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime
- URL: http://arxiv.org/abs/2512.09850v1
- Date: Wed, 10 Dec 2025 17:34:55 GMT
- Title: Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime
- Authors: Simone Cuonzo, Nina Deliu,
- Abstract summary: We introduce Conformal Bandits, a novel framework integrating Conformal Prediction into bandit problems.<n>We bridge the regret-minimising potential of a decision-making bandit policy with statistical guarantees in the form of finite-time prediction coverage.<n>Motivated by this, we showcase our framework's practical advantage in terms of regret in small-gap settings.
- Score: 0.39082875522676397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Conformal Bandits, a novel framework integrating Conformal Prediction (CP) into bandit problems, a classic paradigm for sequential decision-making under uncertainty. Traditional regret-minimisation bandit strategies like Thompson Sampling and Upper Confidence Bound (UCB) typically rely on distributional assumptions or asymptotic guarantees; further, they remain largely focused on regret, neglecting their statistical properties. We address this gap. Through the adoption of CP, we bridge the regret-minimising potential of a decision-making bandit policy with statistical guarantees in the form of finite-time prediction coverage. We demonstrate the potential of it Conformal Bandits through simulation studies and an application to portfolio allocation, a typical small-gap regime, where differences in arm rewards are far too small for classical policies to achieve optimal regret bounds in finite sample. Motivated by this, we showcase our framework's practical advantage in terms of regret in small-gap settings, as well as its added value in achieving nominal coverage guarantees where classical UCB policies fail. Focusing on our application of interest, we further illustrate how integrating hidden Markov models to capture the regime-switching behaviour of financial markets, enhances the exploration-exploitation trade-off, and translates into higher risk-adjusted regret efficiency returns, while preserving coverage guarantees.
Related papers
- PAC-Bayesian Generalization Guarantees for Fairness on Stochastic and Deterministic Classifiers [8.438034474012044]
We propose a PAC-Bayesian framework for deriving generalization bounds for fairness.<n>Our framework has two advantages: (i) It applies to a broad class of fairness measures that can be expressed as a risk discrepancy, and (ii) it leads to a self-bounding algorithm.
arXiv Detail & Related papers (2026-02-12T08:49:34Z) - A Simple Reduction Scheme for Constrained Contextual Bandits with Adversarial Contexts via Regression [7.798233121583888]
We study constrained contextual bandits with adversarially chosen contexts, where each action yields a random reward and incurs a random cost.<n>We adopt the standard realizability assumption: conditioned on the observed context, rewards and costs are drawn independently from fixed distributions whose expectations belong to known function classes.
arXiv Detail & Related papers (2026-02-04T20:19:55Z) - Likelihood Reward Redistribution [0.0]
We propose a emphLikelihood Reward Redistribution (LRR) framework for reward redistribution.<n>When integrated with an off-policy algorithm such as Soft Actor-Critic, LRR yields dense and informative reward signals.
arXiv Detail & Related papers (2025-03-20T20:50:49Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk [26.397343668067382]
We study the optimal trade-off between expectation and tail risk for regret distribution in the multi-armed bandit model.<n>New policies are proposed to characterize the optimal regret tail probability for any regret threshold.
arXiv Detail & Related papers (2023-04-10T01:00:18Z) - Contextual bandits with concave rewards, and an application to fair
ranking [108.48223948875685]
We present the first algorithm with provably vanishing regret for Contextual Bandits with Concave Rewards (CBCR)
We derive a novel reduction from the CBCR regret to the regret of a scalar-reward problem.
Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives.
arXiv Detail & Related papers (2022-10-18T16:11:55Z) - Conformal Off-Policy Prediction in Contextual Bandits [54.67508891852636]
Conformal off-policy prediction can output reliable predictive intervals for the outcome under a new target policy.
We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup.
arXiv Detail & Related papers (2022-06-09T10:39:33Z) - Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax
Regret [5.1398743023989555]
We study the nonstationary Multi-Armed Bandit (MAB) problem in which the distribution of rewards associated with each arm are assumed to be time-varying.
We characterize the performance of the proposed policies in terms of the worst-case regret, which is the supremum of the regret over the set of reward distribution sequences satisfying the variation budget.
arXiv Detail & Related papers (2021-01-22T07:34:09Z) - Selective Classification via One-Sided Prediction [54.05407231648068]
One-sided prediction (OSP) based relaxation yields an SC scheme that attains near-optimal coverage in the practically relevant high target accuracy regime.
We theoretically derive bounds generalization for SC and OSP, and empirically we show that our scheme strongly outperforms state of the art methods in coverage at small error levels.
arXiv Detail & Related papers (2020-10-15T16:14:27Z) - On Lower Bounds for Standard and Robust Gaussian Process Bandit
Optimization [55.937424268654645]
We consider algorithm-independent lower bounds for the problem of black-box optimization of functions having a bounded norm.
We provide a novel proof technique for deriving lower bounds on the regret, with benefits including simplicity, versatility, and an improved dependence on the error probability.
arXiv Detail & Related papers (2020-08-20T03:48:14Z) - Reinforcement Learning of Risk-Constrained Policies in Markov Decision
Processes [5.081241420920605]
Markov decision processes (MDPs) are the defacto frame-work for sequential decision making in the presence ofstochastic uncertainty.
We consider MDPswith discounted-sum payoff with failure states which repre-sent catastrophic outcomes.
Our maincontribution is an efficient risk-constrained planning algo-rithm that combines UCT-like search with a predictor learnedthrough interaction with the MDP.
arXiv Detail & Related papers (2020-02-27T13:36:36Z) - Corruption-robust exploration in episodic reinforcement learning [76.19192549843727]
We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions.
Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
arXiv Detail & Related papers (2019-11-20T03:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.