Related papers: LC-Tsallis-INF: Generalized Best-of-Both-Worlds Linear Contextual Bandits

Related papers

Regularized Online RLHF with Generalized Bilinear Preferences [68.44113000390544]
We consider the problem of contextual online RLHF with general preferences.<n>We adopt the Generalized Bilinear Preference Model to capture preferences via low-rank, skew-symmetric matrices.<n>We prove that the dual gap of the greedy policy is bounded by the square of the estimation error.
arXiv Detail & Related papers (2026-02-26T15:27:53Z)
Truly Adapting to Adversarial Constraints in Constrained MABs [33.41566575424402]
We study the constrained variant of the emphmulti-armed bandit (MAB) problem.<n>We propose an algorithm attaining $widetildemathcalO(sqrtT+C)$ regret and $widetildemathcalO(sqrtT+C)$ positive violation.<n>We then show how to extend these guarantees when only bandit feedback is available for the losses.
arXiv Detail & Related papers (2026-02-16T08:07:11Z)
Improved Best-of-Both-Worlds Regret for Bandits with Delayed Feedback [40.86836752251116]
We study the multi-armed bandit problem with adversarially chosen delays in the Best-Both-Worlds (BoBW) framework.<n>Our main contribution is a new algorithm that matches the known lower bounds in each setting individually.<n>We provide a regret bound which scale as $sum_i>0left(log T/Delta_iright) + frac1Ksum Delta_i sigma_max$, where $Delta_i$ is the sub-optimality gap of arm $i$ and $
arXiv Detail & Related papers (2025-05-30T04:05:52Z)
Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness [74.18546828528298]
We show that SignSGD with Majority Voting can robustly work on the whole range of complexity with $kappakappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappa
arXiv Detail & Related papers (2025-02-11T19:54:11Z)
No-Regret Linear Bandits under Gap-Adjusted Misspecification [38.592043705502725]
Existing linear bandits work usually relies on a uniform misspecification parameter $epsilon$ that measures the sup-norm error of the best linear approximation. We propose a more natural model of misspecification which only requires the approximation error at each input $x$ to be proportional to the suboptimality gap at $x$. We show that the classical LinUCB algorithm -- designed for the realizable case -- is automatically robust against such $rho$-gap-adjusted misspecification.
arXiv Detail & Related papers (2025-01-09T16:44:53Z)
Efficient Frameworks for Generalized Low-Rank Matrix Bandit Problems [61.85150061213987]
We study the generalized low-rank matrix bandit problem, proposed in citelu2021low under the Generalized Linear Model (GLM) framework. To overcome the computational infeasibility and theoretical restrain of existing algorithms, we first propose the G-ESTT framework. We show that G-ESTT can achieve the $tildeO(sqrt(d_1+d_2)3/2Mr3/2T)$ bound of regret while G-ESTS can achineve the $tildeO
arXiv Detail & Related papers (2024-01-14T14:14:19Z)
Best-of-Both-Worlds Linear Contextual Bandits [45.378265414553226]
This study investigates the problem of $K$-armed linear contextual bandits, an instance of the multi-armed bandit problem, under an adversarial corruption. We develop a strategy that is effective in both adversarial environments, with theoretical guarantees. We refer to our strategy as the Best-of-Both-Worlds (BoBW) RealFTRL, due to its theoretical guarantees in both adversarial regimes.
arXiv Detail & Related papers (2023-12-27T09:32:18Z)
Best-of-Both-Worlds Algorithms for Linear Contextual Bandits [11.94312915280916]
We study best-of-both-worlds algorithms for $K$-armed linear contextual bandits. Our algorithms deliver near-optimal regret bounds in both the adversarial and adversarial regimes.
arXiv Detail & Related papers (2023-12-24T08:27:30Z)
On the Interplay Between Misspecification and Sub-optimality Gap in Linear Contextual Bandits [76.2262680277608]
We study linear contextual bandits in the misspecified setting, where the expected reward function can be approximated by a linear function class. We show that our algorithm enjoys the same gap-dependent regret bound $tilde O (d2/Delta)$ as in the well-specified setting up to logarithmic factors.
arXiv Detail & Related papers (2023-03-16T15:24:29Z)
Optimal Extragradient-Based Bilinearly-Coupled Saddle-Point Optimization [116.89941263390769]
We consider the smooth convex-concave bilinearly-coupled saddle-point problem, $min_mathbfxmax_mathbfyF(mathbfx) + H(mathbfx,mathbfy)$, where one has access to first-order oracles for $F$, $G$ as well as the bilinear coupling function $H$. We present a emphaccelerated gradient-extragradient (AG-EG) descent-ascent algorithm that combines extragrad
arXiv Detail & Related papers (2022-06-17T06:10:20Z)
On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function. We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z)
Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits [45.43968161616453]
We study the optimal batch-regret tradeoff for batch linear contextual bandits. We prove its regret guarantee, which features a two-phase expression as the time horizon $T$ grows. We also prove a new matrix inequality concentration with dependence on their dynamic upper bounds.
arXiv Detail & Related papers (2021-10-15T12:32:33Z)
Regret Lower Bound and Optimal Algorithm for High-Dimensional Contextual Linear Bandit [10.604939762790517]
We prove a minimax lower bound, $mathcalObig((log d)fracalpha+12Tfrac1-alpha2+log Tbig)$, for the cumulative regret. Second, we propose a simple and computationally efficient algorithm inspired by the general Upper Confidence Bound (UCB) strategy.
arXiv Detail & Related papers (2021-09-23T19:35:38Z)
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization [51.23789922123412]
We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions admit a "pseudo-1d" structure. We show a lower bound of $min(sqrtdT, T3/4)$ for the regret of any algorithm, where $T$ is the number of rounds. We propose a new algorithm sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively.
arXiv Detail & Related papers (2021-02-15T08:16:51Z)
Thresholded Lasso Bandit [70.17389393497125]
Thresholded Lasso bandit is an algorithm that estimates the vector defining the reward function as well as its sparse support. We establish non-asymptotic regret upper bounds scaling as $mathcalO( log d + sqrtT )$ in general, and as $mathcalO( log d + sqrtT )$ under the so-called margin condition.
arXiv Detail & Related papers (2020-10-22T19:14:37Z)
Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
$Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems. We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.