Related papers: Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality

Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality

URL: http://arxiv.org/abs/2006.08754v2
Date: Fri, 23 Oct 2020 01:51:16 GMT
Title: Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality
Authors: Kwang-Sung Jun, Chicheng Zhang
Abstract summary: We study structured bandits for minimizing regret. In this paper we focus on the finite hypothesis and ask if one can achieve the optimality while enjoying bounded regret.
Score: 25.627939043735683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study stochastic structured bandits for minimizing regret. The fact that the popular optimistic algorithms do not achieve the asymptotic instance-dependent regret optimality (asymptotic optimality for short) has recently alluded researchers. On the other hand, it is known that one can achieve bounded regret (i.e., does not grow indefinitely with $n$) in certain instances. Unfortunately, existing asymptotically optimal algorithms rely on forced sampling that introduces an $\omega(1)$ term w.r.t. the time horizon $n$ in their regret, failing to adapt to the "easiness" of the instance. In this paper, we focus on the finite hypothesis case and ask if one can achieve the asymptotic optimality while enjoying bounded regret whenever possible. We provide a positive answer by introducing a new algorithm called CRush Optimism with Pessimism (CROP) that eliminates optimistic hypotheses by pulling the informative arms indicated by a pessimistic hypothesis. Our finite-time analysis shows that CROP $(i)$ achieves a constant-factor asymptotic optimality and, thanks to the forced-exploration-free design, $(ii)$ adapts to bounded regret, and $(iii)$ its regret bound scales not with $K$ but with an effective number of arms $K_\psi$ that we introduce. We also discuss a problem class where CROP can be exponentially better than existing algorithms in \textit{nonasymptotic} regimes. This problem class also reveals a surprising fact that even a clairvoyant oracle who plays according to the asymptotically optimal arm pull scheme may suffer a linear worst-case regret.

Related papers

Optimal Batched Linear Bandits [9.795531604467406]
We introduce the E$4$ algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Exploit framework. With a proper choice of exploration rate, we prove E$4$ achieves finite--time minimax optimal regret with only $O(loglog T)$ batches. We show that with another choice of exploration rate E$4$ achieves an instance-dependent regret bound requiring at most $O(log T)$ batches.
arXiv Detail & Related papers (2024-06-06T14:57:52Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Instance-Optimality in Interactive Decision Making: Toward a Non-Asymptotic Theory [30.061707627742766]
We aim for instance-optimality, a strong notion of adaptivity which asserts that, on any particular problem instance, the algorithm under consideration outperforms all consistent algorithms. In this paper, we take the first step toward developing a non-asymptotic theory of instance-optimal decision making with general function approximation.
arXiv Detail & Related papers (2023-04-24T21:51:58Z)
Improved High-Probability Regret for Adversarial Bandits with Time-Varying Feedback Graphs [62.52390282012508]
We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds. We develop an algorithm that achieves the optimal regret $widetildemathcalO((sum_t=1Talpha_t)1/2+max_tin[T]alpha_t]$ with high probability. We also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs.
arXiv Detail & Related papers (2022-10-04T04:36:15Z)
Stochastic Online Learning with Feedback Graphs: Finite-Time and Asymptotic Optimality [39.2230418563572]
We show that, surprisingly, the notion of optimal finite-time regret is not a uniquely defined property in this context. We give an algorithm that admits quasi-optimal regret both in finite-time andally.
arXiv Detail & Related papers (2022-06-20T22:11:08Z)
Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary. Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z)
An Asymptotically Optimal Primal-Dual Incremental Algorithm for Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions. We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z)
ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms. We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z)
Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z)
Predictive Power of Nearest Neighbors Algorithm under Random Perturbation [21.79888306754263]
We consider a data corruption scenario in the classical $k$ Nearest Neighbors ($k$-NN) Under such a scenario, the impact of corruption level on the regret is carefully characterized.
arXiv Detail & Related papers (2020-02-13T01:35:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.