Related papers: Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards

Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards

URL: http://arxiv.org/abs/2010.12866v2
Date: Wed, 27 Oct 2021 07:10:55 GMT
Title: Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards
Authors: Kyungjae Lee and Hongjun Yang and Sungbin Lim and Songhwai Oh
Abstract summary: We consider multi-armed bandits with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $nu_p$ for $1pleq2$. We propose a novel robust estimator which does not require $nu_p$ as prior information. We show that an error probability of the proposed estimator decays exponentially fast.
Score: 24.983866845065926
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we consider stochastic multi-armed bandits (MABs) with heavy-tailed rewards, whose $p$-th moment is bounded by a constant $\nu_{p}$ for $1<p\leq2$. First, we propose a novel robust estimator which does not require $\nu_{p}$ as prior information, while other existing robust estimators demand prior knowledge about $\nu_{p}$. We show that an error probability of the proposed estimator decays exponentially fast. Using this estimator, we propose a perturbation-based exploration strategy and develop a generalized regret analysis scheme that provides upper and lower regret bounds by revealing the relationship between the regret and the cumulative density function of the perturbation. From the proposed analysis scheme, we obtain gap-dependent and gap-independent upper and lower regret bounds of various perturbations. We also find the optimal hyperparameters for each perturbation, which can achieve the minimax optimal regret bound with respect to total rounds. In simulation, the proposed estimator shows favorable performance compared to existing robust estimators for various $p$ values and, for MAB problems, the proposed perturbation strategy outperforms existing exploration methods.

Related papers

Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget [55.938644481736446]
We introduce a novel algorithm for best feasible arm identification that guarantees an exponential decay in the error probability.<n>We validate our algorithm through comprehensive empirical evaluations across various problem instances with different levels of complexity.
arXiv Detail & Related papers (2025-06-03T02:56:26Z)
Minimax Rate-Optimal Algorithms for High-Dimensional Stochastic Linear Bandits [1.2010968598596632]
We study the linear bandit problem with multiple arms over $T$ rounds.<n>We show that Lasso estimators are provably suboptimal in the sequential setting.<n>We propose a three-stage arm selection algorithm that uses thresholded Lasso as the main estimation method.
arXiv Detail & Related papers (2025-05-23T02:20:00Z)
Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback. This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc. Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z)
Regret Minimization and Statistical Inference in Online Decision Making with High-dimensional Covariates [7.21848268647674]
We integrate the $varepsilon$-greedy bandit algorithm for decision-making with a hard thresholding algorithm for estimating sparse bandit parameters. Under a margin condition, our method achieves either $O(T1/2)$ regret or classical $O(T1/2)$-consistent inference.
arXiv Detail & Related papers (2024-11-10T01:47:11Z)
Multivariate root-n-consistent smoothing parameter free matching estimators and estimators of inverse density weighted expectations [51.000851088730684]
We develop novel modifications of nearest-neighbor and matching estimators which converge at the parametric $sqrt n $-rate. We stress that our estimators do not involve nonparametric function estimators and in particular do not rely on sample-size dependent parameters smoothing.
arXiv Detail & Related papers (2024-07-11T13:28:34Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Constrained Pure Exploration Multi-Armed Bandits with a Fixed Budget [4.226118870861363]
We consider a constrained, pure exploration, multi-armed bandit formulation under a fixed budget. We propose an algorithm called textscConstrained-SR based on the Successive Rejects framework. We show that the associated decay rate is nearly optimal relative to an information theoretic lower bound in certain special cases.
arXiv Detail & Related papers (2022-11-27T08:58:16Z)
Tractable and Near-Optimal Adversarial Algorithms for Robust Estimation in Contaminated Gaussian Models [1.609950046042424]
Consider the problem of simultaneous estimation of location and variance matrix under Huber's contaminated Gaussian model. First, we study minimum $f$-divergence estimation at the population level, corresponding to a generative adversarial method with a nonparametric discriminator. We develop tractable adversarial algorithms with simple spline discriminators, which can be implemented via nested optimization. The proposed methods are shown to achieve minimax optimal rates or near-optimal rates depending on the $f$-divergence and the penalty used.
arXiv Detail & Related papers (2021-12-24T02:46:51Z)
Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks. We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution. We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z)
Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes. Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z)
Regret-Optimal Filtering [57.51328978669528]
We consider the problem of filtering in linear state-space models through the lens of regret optimization. We formulate a novel criterion for filter design based on the concept of regret between the estimation error energy of a clairvoyant estimator. We show that the regret-optimal estimator can be easily implemented by solving three Riccati equations and a single Lyapunov equation.
arXiv Detail & Related papers (2021-01-25T19:06:52Z)
Reward Biased Maximum Likelihood Estimation for Reinforcement Learning [13.820705458648233]
Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed. We show that it has a regret of $mathcalO( log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms.
arXiv Detail & Related papers (2020-11-16T06:09:56Z)
A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
We introduce a general analysis framework and a family of algorithms for the linear bandit problem. Our new notion of optimism in expectation gives rise to a new algorithm, called sieved greedy (SG) that reduces the overexploration problem in OFUL. In addition to proving that SG is theoretically rate optimal, our empirical simulations show that SG outperforms existing benchmarks such as greedy, OFUL, and TS.
arXiv Detail & Related papers (2020-02-12T18:54:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.