Exploration via linearly perturbed loss minimisation
- URL: http://arxiv.org/abs/2311.07565v2
- Date: Wed, 6 Mar 2024 12:28:17 GMT
- Title: Exploration via linearly perturbed loss minimisation
- Authors: David Janz, Shuai Liu, Alex Ayoub, Csaba Szepesv\'ari
- Abstract summary: We introduce exploration via linear loss perturbations (EVILL) for structured bandit problems.
We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards.
We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods.
- Score: 4.856378369489158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce exploration via linear loss perturbations (EVILL), a randomised
exploration method for structured stochastic bandit problems that works by
solving for the minimiser of a linearly perturbed regularised negative
log-likelihood function. We show that, for the case of generalised linear
bandits, EVILL reduces to perturbed history exploration (PHE), a method where
exploration is done by training on randomly perturbed rewards. In doing so, we
provide a simple and clean explanation of when and why random reward
perturbations give rise to good bandit algorithms. We propose data-dependent
perturbations not present in previous PHE-type methods that allow EVILL to
match the performance of Thompson-sampling-style parameter-perturbation
methods, both in theory and in practice. Moreover, we show an example outside
generalised linear bandits where PHE leads to inconsistent estimates, and thus
linear regret, while EVILL remains performant. Like PHE, EVILL can be
implemented in just a few lines of code.
Related papers
- PopArt: Efficient Sparse Regression and Experimental Design for Optimal
Sparse Linear Bandits [29.097522376094624]
We propose a simple and computationally efficient sparse linear estimation method called PopArt.
We derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art.
arXiv Detail & Related papers (2022-10-25T19:13:20Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Information Directed Sampling for Sparse Linear Bandits [42.232086950768476]
We develop a class of information-theoretic Bayesian regret bounds that nearly match existing lower bounds on a variety of problem instances.
Numerical results demonstrate significant regret reductions by sparse IDS relative to several baselines.
arXiv Detail & Related papers (2021-05-29T10:26:23Z) - Online Limited Memory Neural-Linear Bandits with Likelihood Matching [53.18698496031658]
We study neural-linear bandits for solving problems where both exploration and representation learning play an important role.
We propose a likelihood matching algorithm that is resilient to catastrophic forgetting and is completely online.
arXiv Detail & Related papers (2021-02-07T14:19:07Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Reward-Biased Maximum Likelihood Estimation for Linear Stochastic
Bandits [16.042075861624056]
We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods.
The new policies achieve this with low time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency.
arXiv Detail & Related papers (2020-10-08T16:17:53Z) - A Smoothed Analysis of Online Lasso for the Sparse Linear Contextual
Bandit Problem [25.5073029104128]
We prove that the simple online Lasso supports sparse linear contextual bandit with regret bound $mathcalO(sqrtkTlog d)$.
Special structures in our results explicitly characterize how the perturbation affects exploration length.
arXiv Detail & Related papers (2020-07-16T18:48:45Z) - Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector.
We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z) - Multiscale Non-stationary Stochastic Bandits [83.48992319018147]
We propose a novel multiscale changepoint detection method for the non-stationary linear bandit problems, called Multiscale-LinUCB.
Experimental results show that our proposed Multiscale-LinUCB algorithm outperforms other state-of-the-art algorithms in non-stationary contextual environments.
arXiv Detail & Related papers (2020-02-13T00:24:17Z) - Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB.
We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits.
Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z) - Perturbed-History Exploration in Stochastic Linear Bandits [35.70267786499955]
We propose a new online algorithm for cumulative regret in a linear bandit.
The algorithm pulls the arm with the highest estimated reward in a linear model trained on its perturbed history.
arXiv Detail & Related papers (2019-03-21T17:45:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.