Stochastic Linear Bandits with Parameter Noise
- URL: http://arxiv.org/abs/2601.23164v1
- Date: Fri, 30 Jan 2026 16:47:42 GMT
- Title: Stochastic Linear Bandits with Parameter Noise
- Authors: Daniel Ezer, Alon Peled-Cohen, Yishay Mansour,
- Abstract summary: We show a regret upper bound of $widetildeO (sqrtd T log (K/) 2_max)$ for a horizon $T$, general action set of size $K$ of dimension $d$, and where $2_max$ is the maximal variance of the reward for any action.<n>Surprisingly, we show that this optimal (up to logarithmic factors) regret bound is attainable using a very simple explore-exploit algorithm.
- Score: 36.09200986359924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the stochastic linear bandits with parameter noise model, in which the reward of action $a$ is $a^\top θ$ where $θ$ is sampled i.i.d. We show a regret upper bound of $\widetilde{O} (\sqrt{d T \log (K/δ) σ^2_{\max})}$ for a horizon $T$, general action set of size $K$ of dimension $d$, and where $σ^2_{\max}$ is the maximal variance of the reward for any action. We further provide a lower bound of $\widetildeΩ (d \sqrt{T σ^2_{\max}})$ which is tight (up to logarithmic factors) whenever $\log (K) \approx d$. For more specific action sets, $\ell_p$ unit balls with $p \leq 2$ and dual norm $q$, we show that the minimax regret is $\widetildeΘ (\sqrt{dT σ^2_q)}$, where $σ^2_q$ is a variance-dependent quantity that is always at most $4$. This is in contrast to the minimax regret attainable for such sets in the classic additive noise model, where the regret is of order $d \sqrt{T}$. Surprisingly, we show that this optimal (up to logarithmic factors) regret bound is attainable using a very simple explore-exploit algorithm.
Related papers
- Prior Diffusiveness and Regret in the Linear-Gaussian Bandit [25.66105507730649]
We prove that Thompson sampling exhibits $tildeO(d sqrtT + d r sqrtmathrmTr(_0))$ Bayesian regret in the linear-Gaussian bandit.<n>We establish these results via a new elliptical potential'' lemma, and also provide a lower bound indicating that the burn-in term is unavoidable.
arXiv Detail & Related papers (2026-01-05T11:30:08Z) - Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs [54.28273395444243]
We show that the Monotonic Value Omega (MVP) algorithm achieves a variance-aware gap-dependent regret bound of $$tildeOleft(left(sum_Delta_h(s,a)>0 fracH2 log K land mathttVar_maxtextc$.
arXiv Detail & Related papers (2025-06-06T20:33:57Z) - Variance-Dependent Regret Lower Bounds for Contextual Bandits [65.93854043353328]
Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $tildeO(dsqrtK)$ regret bound to $tildeO(dsqrtsum_k=1Ksigma_k2)$.
arXiv Detail & Related papers (2025-03-15T07:09:36Z) - Linear bandits with polylogarithmic minimax regret [8.97780713904412]
We study a noise model for linear bandits for which the subgaussian noise parameter vanishes linearly as we select actions on the unit sphere closer and closer to the unknown vector.
We introduce an algorithm for this problem that exhibits a minimax regret scaling as $log3(T)$ in the time horizon $T$, in stark contrast the square root scaling of this regret for typical bandit algorithms.
arXiv Detail & Related papers (2024-02-19T10:56:47Z) - Nearly Minimax Optimal Submodular Maximization with Bandit Feedback [12.28389976959093]
We minimize the learner's regret with respect to an approximation of the maximum $f(S_*)$ with $|S_*| = k$.<n>In this work, we establish the first minimax lower bound for this setting that scales like $tildeOmega(min_L le k(T2/3 + sqrtn choose k - LT)$.<n>For a slightly restricted algorithm class, we prove a stronger regret lower bound of $tildeOmega(min_L
arXiv Detail & Related papers (2023-10-27T20:19:03Z) - Context-lumpable stochastic bandits [49.024050919419366]
We consider a contextual bandit problem with $S$ contexts and $K$ actions.
We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples.
In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
arXiv Detail & Related papers (2023-06-22T17:20:30Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Bandit Phase Retrieval [45.12888201134656]
We prove that the minimax cumulative regret in this problem is $smashtilde Theta(d sqrtn)$.
We also show that the minimax simple regret is $smashtilde Theta(d / sqrtn)$ and that this is only achievable by an adaptive algorithm.
arXiv Detail & Related papers (2021-06-03T08:04:33Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z) - Tight Regret Bounds for Noisy Optimization of a Brownian Motion [118.65407541895851]
We consider the problem of Bayesian optimization of a one-dimensional Brownian motion in which the $T$ adaptively chosen observations are corrupted by Gaussian noise.
We show that as the smallest possible expected cumulative regret and the smallest possible expected simple regret scale as $Omega(sigmasqrtT cdot log T)$.
arXiv Detail & Related papers (2020-01-25T14:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.