SCaLE: Switching Cost aware Learning and Exploration
- URL: http://arxiv.org/abs/2601.09042v1
- Date: Wed, 14 Jan 2026 00:14:58 GMT
- Title: SCaLE: Switching Cost aware Learning and Exploration
- Authors: Neelkamal Bhuyan, Debankur Mukherjee, Adam Wierman,
- Abstract summary: This work addresses the fundamental problem of metric movement costs in bandit online convex optimization.<n>For an unbounded general class of environments, we provide the first algorithm SCaLE that achieves a distribution-agnostic sub-linear dynamic regret.<n>We present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret.
- Score: 20.483938439210522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and $\ell_2$-norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the first algorithm SCaLE that provably achieves a distribution-agnostic sub-linear dynamic regret, without the knowledge of hitting cost structure. En-route, we present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret. Extensive numerical experiments, against online-learning baselines, corroborate our claims, and highlight statistical consistency of our algorithm.
Related papers
- Dynamic Regret via Discounted-to-Dynamic Reduction with Applications to Curved Losses and Adam Optimizer [72.0797062226335]
We build discounted-to-dynamic dynamic regret minimization methods.<n>We focus on two representative curved losses: linear regression and logistic regression.<n>Our method yields new dynamic regret guarantees for online logistic regression.
arXiv Detail & Related papers (2026-02-09T08:10:53Z) - Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update [70.38810219913593]
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function.<n>GLBs are widely applicable to real-world scenarios, but their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency.<n>We propose a jointly efficient algorithm that attains a nearly optimal regret bound with $mathcalO(1)$ time and space complexities per round.
arXiv Detail & Related papers (2025-07-16T02:24:21Z) - Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z) - Dynamic Pricing in the Linear Valuation Model using Shape Constraints [21.319339643047826]
We propose a shape-constrained approach to dynamic pricing for censored data in the linear valuation model.<n>Our method attains lower empirical regret in comparison to several existing methods in the literature.
arXiv Detail & Related papers (2025-02-09T04:58:33Z) - Efficient Methods for Non-stationary Online Learning [63.268670895111654]
We present efficient methods for optimizing dynamic regret and adaptive regret.<n>The proposed algorithms require only one gradient query and one function evaluation at each round.<n>We also study an even stronger measure, namely "interval dynamic regret", and reduce the number of projections per round from $O(log2 T)$ to $1$.
arXiv Detail & Related papers (2023-09-16T07:30:12Z) - Rate-Optimal Online Convex Optimization in Adaptive Linear Control [0.0]
We consider the problem of controlling an unknown convex linear system under adversarially changing costs.
We present the first computationally-gret that attains an optimal linear hindsight function.
arXiv Detail & Related papers (2022-06-03T07:32:11Z) - Online Nonsubmodular Minimization with Delayed Costs: From Full
Information to Bandit Feedback [98.7678704343537]
We focus on a class of nonsubmodular functions with special structure, and prove regret guarantees for several variants of the online and approximate online bandit gradient descent algorithms.
We derive bounds for the agent's regret in the full information and bandit feedback setting, even if the delay between choosing a decision and receiving the incurred cost is unbounded.
arXiv Detail & Related papers (2022-05-15T08:27:12Z) - Smoothed Online Convex Optimization Based on Discounted-Normal-Predictor [68.17855675511602]
We investigate an online prediction strategy named as Discounted-Normal-Predictor (Kapralov and Panigrahy, 2010) for smoothed online convex optimization (SOCO)
We show that the proposed algorithm can minimize the adaptive regret with switching cost in every interval.
arXiv Detail & Related papers (2022-05-02T08:48:22Z) - Rate-matching the regret lower-bound in the linear quadratic regulator
with unknown dynamics [6.287145010885044]
This paper establishes a novel regret upper-bound of $O_p(sqrtT,textpolylog(T))$.
It simultaneously establishes an estimation error bound on the dynamics of $O_p(sqrtT,textpolylog(T))$.
arXiv Detail & Related papers (2022-02-11T17:50:14Z) - Reinforcement Learning Policies in Continuous-Time Linear Systems [0.0]
We present online policies that learn optimal actions fast by carefully randomizing the parameter estimates.
We prove sharp stability results for inexact system dynamics and tightly specify the infinitesimal regret caused by sub-optimal actions.
Our analysis sheds light on fundamental challenges in continuous-time reinforcement learning and suggests a useful cornerstone for similar problems.
arXiv Detail & Related papers (2021-09-16T00:08:50Z) - Online Adversarial Attacks [57.448101834579624]
We formalize the online adversarial attack problem, emphasizing two key elements found in real-world use-cases.
We first rigorously analyze a deterministic variant of the online threat model.
We then propose algoname, a simple yet practical algorithm yielding a provably better competitive ratio for $k=2$ over the current best single threshold algorithm.
arXiv Detail & Related papers (2021-03-02T20:36:04Z) - Online Policy Gradient for Model Free Learning of Linear Quadratic
Regulators with $\sqrt{T}$ Regret [0.0]
We present the first model-free algorithm that achieves similar regret guarantees.
Our method relies on an efficient policy gradient scheme, and a novel and tighter analysis of the cost of exploration in policy space.
arXiv Detail & Related papers (2021-02-25T00:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.