Related papers: On Limited-Memory Subsampling Strategies for Bandits

On Limited-Memory Subsampling Strategies for Bandits

URL: http://arxiv.org/abs/2106.10935v1
Date: Mon, 21 Jun 2021 09:11:22 GMT
Title: On Limited-Memory Subsampling Strategies for Bandits
Authors: Dorian Baudry (Inria, CRIStAL, CNRS), Yoan Russac (DI-ENS, CNRS, VALDA), Olivier Capp\'e (DI-ENS, CNRS, VALDA)
Abstract summary: We show that a simple deterministic subsampling rule, proposed in the recent work of Baudry et al. ( 2020) is optimal in one-dimensional exponential families. We also prove that these guarantees also hold when limiting the algorithm memory to a polylogarithmic function of the time horizon. We propose a variant of the algorithm in which only the most recent observations are used for subsampling, achieving optimal regret guarantees under the assumption of a known number of abrupt changes.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There has been a recent surge of interest in nonparametric bandit algorithms based on subsampling. One drawback however of these approaches is the additional complexity required by random subsampling and the storage of the full history of rewards. Our first contribution is to show that a simple deterministic subsampling rule, proposed in the recent work of Baudry et al. (2020) under the name of ''last-block subsampling'', is asymptotically optimal in one-parameter exponential families. In addition, we prove that these guarantees also hold when limiting the algorithm memory to a polylogarithmic function of the time horizon. These findings open up new perspectives, in particular for non-stationary scenarios in which the arm distributions evolve over time. We propose a variant of the algorithm in which only the most recent observations are used for subsampling, achieving optimal regret guarantees under the assumption of a known number of abrupt changes. Extensive numerical simulations highlight the merits of this approach, particularly when the changes are not only affecting the means of the rewards.

Related papers

Variational Quantum Optimization with Continuous Bandits [1.6686882054452727]
We introduce a novel approach to variational Quantum algorithms (VQA) via continuous bandits. VQA are a class of hybrid Quantum-classical algorithms where the parameters of Quantum circuits are optimized by classical algorithms.
arXiv Detail & Related papers (2025-02-06T12:24:30Z)
Tracking Most Significant Shifts in Infinite-Armed Bandits [3.591122855617648]
We study an infinite-armed bandit problem where actions' mean rewards are initially sampled from a reservoir distribution. We show the first parameter-free optimal regret bounds for all regimes of regularity of the reservoir. We show that tighter regret bounds in terms of significant shifts can be adaptively attained by employing a randomized variant of elimination.
arXiv Detail & Related papers (2025-01-31T19:00:21Z)
Near-Optimal Algorithm for Non-Stationary Kernelized Bandits [6.379833644595456]
We study a non-stationary kernelized bandit (KB) problem, also called time-varying Bayesian optimization. We show the first algorithm-independent regret lower bound for non-stationary KB with squared exponential and Mat'ern kernels. We propose a novel near-optimal algorithm called restarting phased elimination with random permutation.
arXiv Detail & Related papers (2024-10-21T14:28:26Z)
Variance-Dependent Regret Bounds for Non-stationary Linear Bandits [52.872628573907434]
We propose algorithms that utilize the variance of the reward distribution as well as the $B_K$, and show that they can achieve tighter regret upper bounds. We introduce two novel algorithms: Restarted Weighted$textOFUL+$ and Restarted $textSAVE+$. Notably, when the total variance $V_K$ is much smaller than $K$, our algorithms outperform previous state-of-the-art results on non-stationary linear bandits under different settings.
arXiv Detail & Related papers (2024-03-15T23:36:55Z)
Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z)
Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret [34.44347218903429]
The multi-armed bandit setting has been recently studied in the non-stationary regime. Mean payoff of each action is a non-decreasing function of the number of rounds passed since it was last played. We show how our algorithm can be transformed into a bandit algorithm with sublinear regret.
arXiv Detail & Related papers (2022-05-29T23:55:36Z)
Damped Online Newton Step for Portfolio Selection [96.0297968061824]
We revisit the classic online portfolio selection problem, where at each round a learner selects a distribution over a set of portfolios to allocate its wealth. Existing algorithms that achieve a logarithmic regret for this problem have per-round time and space complexities that scale with the total number of rounds. We present the first practical online portfolio selection algorithm with a logarithmic regret and whose per-round time and space complexities depend only logarithmically on the horizon.
arXiv Detail & Related papers (2022-02-15T17:01:55Z)
A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit [1.2183405753834562]
This work addresses a version of the two-armed Bernoulli bandit problem where the sum of the means of the arms is one. We obtain the leading order terms of the minmax optimal regret and pseudoregret for this problem by associating each of them with a solution of a linear heat equation.
arXiv Detail & Related papers (2022-02-11T17:03:18Z)
Filtered Poisson Process Bandit on a Continuum [1.4180331276028662]
We consider a version of the continuum armed bandit where an action induces a filtered realisation of a non-homogeneous Poisson process. We propose an upper confidence bound algorithm for this problem utilising data-adaptive discretisation of the action space.
arXiv Detail & Related papers (2020-07-20T09:39:17Z)
Approximation Schemes for ReLU Regression [80.33702497406632]
We consider the fundamental problem of ReLU regression. The goal is to output the best fitting ReLU with respect to square loss given to draws from some unknown distribution.
arXiv Detail & Related papers (2020-05-26T16:26:17Z)
Efficient improper learning for online logistic regression [68.8204255655161]
It is known that any proper algorithm which has logarithmic regret in the number of samples (denoted n) necessarily suffers an exponential multiplicative constant in B. In this work, we design an efficient improper algorithm that avoids this exponential constant while preserving a logarithmic regret. Our new algorithm based on regularized empirical risk minimization with surrogate losses satisfies a regret scaling as O(B log(Bn)) with a per-round time-complexity of order O(d2)
arXiv Detail & Related papers (2020-03-18T09:16:14Z)
A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
We introduce a general analysis framework and a family of algorithms for the linear bandit problem. Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL. In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)
Frequentist Regret Bounds for Randomized Least-Squares Value Iteration [94.47472987987805]
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL) In this paper, we introduce an optimistically-perturbed variant of the popular randomized least-squares value (RLSVI) Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $widetilde O(d2 H2 sqrtT)$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of
arXiv Detail & Related papers (2019-11-01T19:48:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.