Stability-penalty-adaptive follow-the-regularized-leader: Sparsity,
game-dependency, and best-of-both-worlds
- URL: http://arxiv.org/abs/2305.17301v2
- Date: Tue, 13 Feb 2024 09:23:48 GMT
- Title: Stability-penalty-adaptive follow-the-regularized-leader: Sparsity,
game-dependency, and best-of-both-worlds
- Authors: Taira Tsuchiya, Shinji Ito, Junya Honda
- Abstract summary: Follow-the-regularized-leader (FTRL) has recently emerged as one of the most promising approaches for obtaining various types of adaptivity in bandit problems.
We establish several algorithms with three types of adaptivity: sparsity, game-dependency, and best-of-both-worlds (BOBW)
- Score: 46.30750729936261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptivity to the difficulties of a problem is a key property in sequential
decision-making problems to broaden the applicability of algorithms.
Follow-the-regularized-leader (FTRL) has recently emerged as one of the most
promising approaches for obtaining various types of adaptivity in bandit
problems. Aiming to further generalize this adaptivity, we develop a generic
adaptive learning rate, called stability-penalty-adaptive (SPA) learning rate
for FTRL. This learning rate yields a regret bound jointly depending on
stability and penalty of the algorithm, into which the regret of FTRL is
typically decomposed. With this result, we establish several algorithms with
three types of adaptivity: sparsity, game-dependency, and best-of-both-worlds
(BOBW). Despite the fact that sparsity appears frequently in real problems,
existing sparse multi-armed bandit algorithms with $k$-arms assume that the
sparsity level $s \leq k$ is known in advance, which is often not the case in
real-world scenarios. To address this issue, we first establish $s$-agnostic
algorithms with regret bounds of $\tilde{O}(\sqrt{sT})$ in the adversarial
regime for $T$ rounds, which matches the existing lower bound up to a
logarithmic factor. Meanwhile, BOBW algorithms aim to achieve a near-optimal
regret in both the stochastic and adversarial regimes. Leveraging the SPA
learning rate and the technique for $s$-agnostic algorithms combined with a new
analysis to bound the variation in FTRL output in response to changes in a
regularizer, we establish the first BOBW algorithm with a sparsity-dependent
bound. Additionally, we explore partial monitoring and demonstrate that the
proposed SPA learning rate framework allows us to achieve a game-dependent
bound and the BOBW simultaneously.
Related papers
- uniINF: Best-of-Both-Worlds Algorithm for Parameter-Free Heavy-Tailed MABs [33.262918224598614]
We present a novel algorithm for the Heavy-Tailed Multi-Armed Bandits (HTMAB) problem, demonstrating robustness and adaptability.
Our novel algorithm uni enjoys the so-called Best-of-Both-Worlds (BoBW) property, performing optimally in both and adversarial environments.
To our knowledge, uniINF is the first parameter-free algorithm to achieve the BoBW property for the heavy-tailed MAB problem.
arXiv Detail & Related papers (2024-10-04T09:55:44Z) - Optimism in the Face of Ambiguity Principle for Multi-Armed Bandits [6.7310264583128445]
Follow-The-Regularized-Leader (FTRL) algorithms often enjoy optimal regret for adversarial as well as bandit problems.
We propose a new FTPL algorithm that generates optimal policies for both adversarial and multi-armed bandits.
arXiv Detail & Related papers (2024-09-30T16:00:23Z) - Indexed Minimum Empirical Divergence-Based Algorithms for Linear Bandits [55.938644481736446]
Indexed Minimum Empirical Divergence (IMED) is a highly effective approach to the multi-armed bandit problem.
It has been observed to empirically outperform UCB-based algorithms and Thompson Sampling.
We present novel linear versions of the IMED algorithm, which we call the family of LinIMED algorithms.
arXiv Detail & Related papers (2024-05-24T04:11:58Z) - A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed
Bandits [0.0]
In areas of high volatility like healthcare or finance, a naive reward approach often does not accurately capture the complexity of the learning problem.
We propose a framework of adaptive risk-aware strategies that operate in non-stationary environments.
arXiv Detail & Related papers (2023-10-24T19:29:13Z) - Improved Best-of-Both-Worlds Guarantees for Multi-Armed Bandits: FTRL
with General Regularizers and Multiple Optimal Arms [41.06668954462585]
We study the problem of designing adaptive multi-armed bandit algorithms that optimally perform in both the setting and the adversarial setting simultaneously.
We show that uniqueness is unnecessary for FTRL with a broad family of regularizers and a new learning rate schedule.
arXiv Detail & Related papers (2023-02-27T06:09:10Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - A Blackbox Approach to Best of Both Worlds in Bandits and Beyond [33.13041034490332]
Best-of-both-worlds algorithms for online learning achieve near-optimal regret in both the adversarial and the adversarial regimes.
We present a general reduction from best of both worlds to a wide family of follow-the-regularized-leader (FTRL) and online-mirrordescent (OMD) algorithms.
arXiv Detail & Related papers (2023-02-20T03:42:31Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.