Kernel $ε$-Greedy for Multi-Armed Bandits with Covariates
- URL: http://arxiv.org/abs/2306.17329v2
- Date: Sun, 01 Jun 2025 13:43:39 GMT
- Title: Kernel $ε$-Greedy for Multi-Armed Bandits with Covariates
- Authors: Sakshi Arya, Bharath K. Sriperumbudur,
- Abstract summary: We estimate the unknown mean reward functions using an online weighted kernel ridge regression estimator.<n>We show that for any choice of kernel and the corresponding RKHS, we achieve a sub-linear regret rate depending on the intrinsic dimensionality of the RKHS.
- Score: 5.115048067424624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the $\epsilon$-greedy strategy for the multi-arm bandit with covariates (MABC) problem, where the mean reward functions are assumed to lie in a reproducing kernel Hilbert space (RKHS). We propose to estimate the unknown mean reward functions using an online weighted kernel ridge regression estimator, and show the resultant estimator to be consistent under appropriate decay rates of the exploration probability sequence, $\{\epsilon_t\}_t$, and regularization parameter, $\{\lambda_t\}_t$. Moreover, we show that for any choice of kernel and the corresponding RKHS, we achieve a sub-linear regret rate depending on the intrinsic dimensionality of the RKHS. Furthermore, we achieve the optimal regret rate of $\sqrt{T}$ under a margin condition for finite-dimensional RKHS.
Related papers
- Differential Privacy in Kernelized Contextual Bandits via Random Projections [8.658538065693206]
We consider the problem of contextual kernel bandits with contexts.<n>The underlying reward function belongs to a known Reproducing Kernel Hilbert Space.<n>We propose a novel algorithm that achieves the state-of-the-art cumulative regret of $widetildemathcalO(sqrtgamma_TT+fracgamma_Tvarepsilon_mathrmDP)$
arXiv Detail & Related papers (2025-07-18T03:54:49Z) - A hierarchical Vovk-Azoury-Warmuth forecaster with discounting for online regression in RKHS [0.0]
We study the problem of online regression with the unconstrained quadratic loss against a time-varying sequence of functions from a Reproducing Kernel Hilbert Space (RKHS)<n>We propose a fully adaptive, hierarchical algorithm, which learns both the discount factor and the number of random features.
arXiv Detail & Related papers (2025-06-27T20:47:52Z) - Differentially Private Kernelized Contextual Bandits [8.658538065693206]
We consider the problem of contextual kernel bandits with contexts, where the underlying reward function belongs to a known Reproducing Kernel Hilbert Space (RKHS)
We propose a novel algorithm that improves upon the state of the art and achieves an error rate of $mathcalOleft(sqrtfracgamma_TT + fracgamma_TT varepsilonright)$ after $T$ queries.
arXiv Detail & Related papers (2025-01-13T04:05:19Z) - Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
We propose a model-based algorithm based on novel cost and reward function estimators.
Our algorithm achieves a regret upper bound of $widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$ where $bar C$ is the cost budget for an episode.
arXiv Detail & Related papers (2024-10-14T04:51:06Z) - The Minimax Rate of HSIC Estimation for Translation-Invariant Kernels [0.0]
We prove that the minimax optimal rate of HSIC estimation on $mathbb Rd$ for Borel measures containing the Gaussians with continuous bounded translation-invariant characteristic kernels is $mathcal O!left(n-1/2right)$.
arXiv Detail & Related papers (2024-03-12T15:13:21Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - The $L^\infty$ Learnability of Reproducing Kernel Hilbert Spaces [3.2931415075553576]
We analyze the learnability of kernel spaces (RKHS) under the $Linfty$ norm.
For dot-product kernels on the sphere, we identify conditions when the $Linfty$ learning can be achieved with Hilbert samples.
arXiv Detail & Related papers (2023-06-05T12:29:13Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Spectral bounds of the $\varepsilon$-entropy of kernel classes [6.028247638616059]
We develop new bounds on the $varepsilon$-entropy of a unit ball in a reproducing kernel space induced by some Mercer kernel $K$.
In our approach we exploit an ellipsoidal structure of a unit ball in RKHS and a previous work on covering numbers of an ellipsoid in the euclidean space.
arXiv Detail & Related papers (2022-04-09T16:45:22Z) - Meta-Learning Hypothesis Spaces for Sequential Decision-making [79.73213540203389]
We propose to meta-learn a kernel from offline data (Meta-KeL)
Under mild conditions, we guarantee that our estimated RKHS yields valid confidence sets.
We also empirically evaluate the effectiveness of our approach on a Bayesian optimization task.
arXiv Detail & Related papers (2022-02-01T17:46:51Z) - Misspecified Gaussian Process Bandit Optimization [59.30399661155574]
Kernelized bandit algorithms have shown strong empirical and theoretical performance for this problem.
We introduce a emphmisspecified kernelized bandit setting where the unknown function can be $epsilon$--uniformly approximated by a function with a bounded norm in some Reproducing Kernel Hilbert Space (RKHS)
We show that our algorithm achieves optimal dependence on $epsilon$ with no prior knowledge of misspecification.
arXiv Detail & Related papers (2021-11-09T09:00:02Z) - Optimal policy evaluation using kernel-based temporal difference methods [78.83926562536791]
We use kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process.
We derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator.
We prove minimax lower bounds over sub-classes of MRPs.
arXiv Detail & Related papers (2021-09-24T14:48:20Z) - Stochastic Linear Bandits with Protected Subspace [51.43660657268171]
We study a variant of the linear bandit problem wherein we optimize a linear objective function but rewards are accrued only to an unknown subspace.
In particular, at each round, the learner must choose whether to query the objective or the protected subspace alongside choosing an action.
Our algorithm, derived from the OFUL principle, uses some of the queries to get an estimate of the protected space.
arXiv Detail & Related papers (2020-11-02T14:59:39Z) - Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed
Rewards [24.983866845065926]
We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$.
We propose a novel robust estimator which does not require $nu_p$ as prior information.
We show that an error probability of the proposed estimator decays exponentially fast.
arXiv Detail & Related papers (2020-10-24T10:44:02Z) - Kernel-Based Reinforcement Learning: A Finite-Time Analysis [53.47210316424326]
We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards.
We empirically validate our approach in continuous MDPs with sparse rewards.
arXiv Detail & Related papers (2020-04-12T12:23:46Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.