Related papers: Truly Adapting to Adversarial Constraints in Constrained MABs

Truly Adapting to Adversarial Constraints in Constrained MABs

URL: http://arxiv.org/abs/2602.14543v1
Date: Mon, 16 Feb 2026 08:07:11 GMT
Title: Truly Adapting to Adversarial Constraints in Constrained MABs
Authors: Francesco Emanuele Stradi, Kalana Kalupahana, Matteo Castiglioni, Alberto Marchesi, Nicola Gatti,
Abstract summary: We study the constrained variant of the emphmulti-armed bandit (MAB) problem.<n>We propose an algorithm attaining $widetildemathcalO(sqrtT+C)$ regret and $widetildemathcalO(sqrtT+C)$ positive violation.<n>We then show how to extend these guarantees when only bandit feedback is available for the losses.
Score: 33.41566575424402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the constrained variant of the \emph{multi-armed bandit} (MAB) problem, in which the learner aims not only at minimizing the total loss incurred during the learning dynamic, but also at controlling the violation of multiple \emph{unknown} constraints, under both \emph{full} and \emph{bandit feedback}. We consider a non-stationary environment that subsumes both stochastic and adversarial models and where, at each round, both losses and constraints are drawn from distributions that may change arbitrarily over time. In such a setting, it is provably not possible to guarantee both sublinear regret and sublinear violation. Accordingly, prior work has mainly focused either on settings with stochastic constraints or on relaxing the benchmark with fully adversarial constraints (\emph{e.g.}, via competitive ratios with respect to the optimum). We provide the first algorithms that achieve optimal rates of regret and \emph{positive} constraint violation when the constraints are stochastic while the losses may vary arbitrarily, and that simultaneously yield guarantees that degrade smoothly with the degree of adversariality of the constraints. Specifically, under \emph{full feedback} we propose an algorithm attaining $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ regret and $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ {positive} violation, where $C$ quantifies the amount of non-stationarity in the constraints. We then show how to extend these guarantees when only bandit feedback is available for the losses. Finally, when \emph{bandit feedback} is available for the constraints, we design an algorithm achieving $\widetilde{\mathcal{O}}(\sqrt{T}+C)$ {positive} violation and $\widetilde{\mathcal{O}}(\sqrt{T}+C\sqrt{T})$ regret.

Related papers

A Reduction from Delayed to Immediate Feedback for Online Convex Optimization with Improved Guarantees [58.59385794080679]
We introduce a continuous-time model under which regret decomposes into a delay-independent learning term and a delay-induced drift term.<n>For bandit convex optimization, we significantly improve existing regret bounds, with delay-dependent terms matching state-of-the-art first-order rates.
arXiv Detail & Related papers (2026-02-02T18:17:34Z)
Beyond Slater's Condition in Online CMDPs with Stochastic and Adversarial Constraints [33.41566575424402]
We study emphonline episodic Constrained Markov Decision Processes (CMDPs) under both and adversarial constraints.<n>We provide a novel algorithm whose guarantees greatly improve those of the state-of-the-art best-of-both-worlds algorithm introduced by Stradi et al.<n>In the adversarial regime, emphi.e., our algorithm ensures sublinear constraint violation without Slater's condition, and sublinear $alpha$-regret with respect to the emphunconstrained optimum.
arXiv Detail & Related papers (2025-09-24T13:38:32Z)
An Optimistic Algorithm for Online Convex Optimization with Adversarial Constraints [55.2480439325792]
We study Online Convex Optimization (OCO) with adversarial constraints.<n>We focus on a setting where the algorithm has access to predictions of the loss and constraint functions.<n>Our results show that we can improve the current best bounds of $ O(sqrtT) $ regret and $ tildeO(sqrtT) $ cumulative constraint violations.
arXiv Detail & Related papers (2024-12-11T03:06:42Z)
No-Regret is not enough! Bandits with General Constraints through Adaptive Regret Minimization [26.415300249303748]
We show that it is possible to circumvent the issue of sublinear violations of constraints by requiring the primal and dual algorithm to be weakly adaptive. In the first case, we show that the algorithm guarantees sublinear regret. In the latter case, we establish a tight competitive ratio of $rho/(1+rho)$. This results allow us to obtain new result for the problem of contextual bandits with linear constraints.
arXiv Detail & Related papers (2024-05-10T16:22:33Z)
LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits [38.41164102066483]
We develop a emphBest-of-Both-Worlds (BoBW) algorithm with regret upper bounds in both adversarial regimes.<n>We show that the proposed algorithm achieves $Oleft(log(T)1+beta2+betaTfrac12+betaright)$ regret under the margin condition.
arXiv Detail & Related papers (2024-03-05T18:59:47Z)
Multi-point Feedback of Bandit Convex Optimization with Hard Constraints [1.8130068086063336]
We study bandit convex optimization with constraints, where the learner aims to generate a sequence of decisions under partial information of loss functions. We adopt the cumulative textithard constraint violation as the metric of constraint violation. Our algorithm attains $O(d2Tmaxc,1-c)$ regret bounds and $O(d2T1-fracc2)$ cumulative hard constraint violation bounds for convex loss functions and time-varying constraints.
arXiv Detail & Related papers (2023-10-17T02:43:22Z)
A Unifying Framework for Online Optimization with Long-Term Constraints [62.35194099438855]
We study online learning problems in which a decision maker has to take a sequence of decisions subject to $m$ long-term constraints. The goal is to maximize their total reward, while at the same time achieving small cumulative violation across the $T$ rounds. We present the first best-of-both-world type algorithm for this general class problems, with no-regret guarantees both in the case in which rewards and constraints are selected according to an unknown model, and in the case in which they are selected at each round by an adversary.
arXiv Detail & Related papers (2022-09-15T16:59:19Z)
Complete Policy Regret Bounds for Tallying Bandits [51.039677652803675]
Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the emphcomplete policy regret We provide an algorithm that w.h.p a complete policy regret guarantee of $tildemathcalO(mKsqrtT)$, where the $tildemathcalO$ notation hides only logarithmic factors.
arXiv Detail & Related papers (2022-04-24T03:10:27Z)
On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function. We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z)
Regret and Cumulative Constraint Violation Analysis for Online Convex Optimization with Long Term Constraints [24.97580261894342]
This paper considers online convex optimization with long term constraints, where constraints can be violated in intermediate rounds, but need to be satisfied in the long run. A novel algorithm is first proposed and it achieves an $mathcalO(Tmaxc,1-c)$ bound for static regret and an $mathcalO(T(1-c)/2)$ bound for cumulative constraint violation.
arXiv Detail & Related papers (2021-06-09T15:18:06Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.