Related papers: Exploration via linearly perturbed loss minimisation

Exploration via linearly perturbed loss minimisation

URL: http://arxiv.org/abs/2311.07565v2
Date: Wed, 6 Mar 2024 12:28:17 GMT
Title: Exploration via linearly perturbed loss minimisation
Authors: David Janz, Shuai Liu, Alex Ayoub, Csaba Szepesv\'ari
Abstract summary: We introduce exploration via linear loss perturbations (EVILL) for structured bandit problems. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods.
Score: 4.856378369489158
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce exploration via linear loss perturbations (EVILL), a randomised exploration method for structured stochastic bandit problems that works by solving for the minimiser of a linearly perturbed regularised negative log-likelihood function. We show that, for the case of generalised linear bandits, EVILL reduces to perturbed history exploration (PHE), a method where exploration is done by training on randomly perturbed rewards. In doing so, we provide a simple and clean explanation of when and why random reward perturbations give rise to good bandit algorithms. We propose data-dependent perturbations not present in previous PHE-type methods that allow EVILL to match the performance of Thompson-sampling-style parameter-perturbation methods, both in theory and in practice. Moreover, we show an example outside generalised linear bandits where PHE leads to inconsistent estimates, and thus linear regret, while EVILL remains performant. Like PHE, EVILL can be implemented in just a few lines of code.

Related papers

An Exploration-free Method for a Linear Stochastic Bandit Driven by a Linear Gaussian Dynamical System [0.9217021281095907]
In multi-armed bandits, a major problem the learner faces is the trade-off between exploration and exploitation. In this paper, we introduce a linear bandit setting where the reward is the output of a linear Gaussian dynamical system. We propose Kalman filter Observability Dependent Exploration (KODE), an exploration-free method that utilizes the Kalman filter predictions to select actions.
arXiv Detail & Related papers (2025-04-04T20:46:35Z)
When and why randomised exploration works (in linear bandits) [9.167278626563007]
We show that when the action space is smooth and strongly convex, randomised exploration algorithms enjoy an $n$-step regret bound to the order $O(dsqrtn log(n))$. Notably, this shows for the first time that there exist non-trivial linear bandit settings where Thompson sampling can achieve optimal dimension dependence in the regret.
arXiv Detail & Related papers (2025-02-13T00:49:28Z)
PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits [29.097522376094624]
We propose a simple and computationally efficient sparse linear estimation method called PopArt. We derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art.
arXiv Detail & Related papers (2022-10-25T19:13:20Z)
Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off. We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus. We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z)
Information Directed Sampling for Sparse Linear Bandits [42.232086950768476]
We develop a class of information-theoretic Bayesian regret bounds that nearly match existing lower bounds on a variety of problem instances. Numerical results demonstrate significant regret reductions by sparse IDS relative to several baselines.
arXiv Detail & Related papers (2021-05-29T10:26:23Z)
Online Limited Memory Neural-Linear Bandits with Likelihood Matching [53.18698496031658]
We study neural-linear bandits for solving problems where both exploration and representation learning play an important role. We propose a likelihood matching algorithm that is resilient to catastrophic forgetting and is completely online.
arXiv Detail & Related papers (2021-02-07T14:19:07Z)
Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics. In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback. Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z)
Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits [16.042075861624056]
We develop novel index policies that we prove achieve order-optimality, and show that they achieve empirical performance competitive with the state-of-the-art benchmark methods. The new policies achieve this with low time per pull for linear bandits, and thereby resulting in both favorable regret as well as computational efficiency.
arXiv Detail & Related papers (2020-10-08T16:17:53Z)
A Smoothed Analysis of Online Lasso for the Sparse Linear Contextual Bandit Problem [25.5073029104128]
We prove that the simple online Lasso supports sparse linear contextual bandit with regret bound $mathcalO(sqrtkTlog d)$. Special structures in our results explicitly characterize how the perturbation affects exploration length.
arXiv Detail & Related papers (2020-07-16T18:48:45Z)
Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z)
Multiscale Non-stationary Stochastic Bandits [83.48992319018147]
We propose a novel multiscale changepoint detection method for the non-stationary linear bandit problems, called Multiscale-LinUCB. Experimental results show that our proposed Multiscale-LinUCB algorithm outperforms other state-of-the-art algorithms in non-stationary contextual environments.
arXiv Detail & Related papers (2020-02-13T00:24:17Z)
Thompson Sampling Algorithms for Mean-Variance Bandits [97.43678751629189]
We develop Thompson Sampling-style algorithms for mean-variance MAB. We also provide comprehensive regret analyses for Gaussian and Bernoulli bandits. Our algorithms significantly outperform existing LCB-based algorithms for all risk tolerances.
arXiv Detail & Related papers (2020-02-01T15:33:50Z)
Perturbed-History Exploration in Stochastic Linear Bandits [35.70267786499955]
We propose a new online algorithm for cumulative regret in a linear bandit. The algorithm pulls the arm with the highest estimated reward in a linear model trained on its perturbed history.
arXiv Detail & Related papers (2019-03-21T17:45:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.