Perturbed-History Exploration in Stochastic Linear Bandits
- URL: http://arxiv.org/abs/1903.09132v2
- Date: Mon, 10 Jul 2023 22:24:34 GMT
- Title: Perturbed-History Exploration in Stochastic Linear Bandits
- Authors: Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, and Craig
Boutilier
- Abstract summary: We propose a new online algorithm for cumulative regret in a linear bandit.
The algorithm pulls the arm with the highest estimated reward in a linear model trained on its perturbed history.
- Score: 35.70267786499955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new online algorithm for cumulative regret minimization in a
stochastic linear bandit. The algorithm pulls the arm with the highest
estimated reward in a linear model trained on its perturbed history. Therefore,
we call it perturbed-history exploration in a linear bandit (LinPHE). The
perturbed history is a mixture of observed rewards and randomly generated
i.i.d. pseudo-rewards. We derive a $\tilde{O}(d \sqrt{n})$ gap-free bound on
the $n$-round regret of LinPHE, where $d$ is the number of features. The key
steps in our analysis are new concentration and anti-concentration bounds on
the weighted sum of Bernoulli random variables. To show the generality of our
design, we generalize LinPHE to a logistic model. We evaluate our algorithms
empirically and show that they are practical.
Related papers
- Minimum Empirical Divergence for Sub-Gaussian Linear Bandits [10.750348548547704]
LinMED is a randomized algorithm that admits a closed-form computation of the arm sampling probabilities.
Our empirical study shows that LinMED has a competitive performance with the state-of-the-art algorithms.
arXiv Detail & Related papers (2024-10-31T21:54:44Z) - Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds.
We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$.
Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z) - Exploration via linearly perturbed loss minimisation [4.856378369489158]
We introduce exploration via linear loss perturbations (EVILL) for structured bandit problems.
We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards.
We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods.
arXiv Detail & Related papers (2023-11-13T18:54:43Z) - Optimal Online Generalized Linear Regression with Stochastic Noise and
Its Application to Heteroscedastic Bandits [88.6139446295537]
We study the problem of online generalized linear regression in the setting of a generalized linear model with possibly unbounded additive noise.
We provide a sharp analysis of the classical follow-the-regularized-leader (FTRL) algorithm to cope with the label noise.
We propose an algorithm based on FTRL to achieve the first variance-aware regret bound.
arXiv Detail & Related papers (2022-02-28T08:25:26Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs)
We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past.
We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z) - Randomized Exploration in Generalized Linear Bandits [56.05007606177762]
We study two randomized algorithms for generalized linear bandits.
The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution.
The second, GLM-FPL, fits a GLM to a randomly perturbed history of past rewards.
arXiv Detail & Related papers (2019-06-21T04:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.