Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic
Optimality
- URL: http://arxiv.org/abs/2006.08754v2
- Date: Fri, 23 Oct 2020 01:51:16 GMT
- Title: Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic
Optimality
- Authors: Kwang-Sung Jun, Chicheng Zhang
- Abstract summary: We study structured bandits for minimizing regret.
In this paper we focus on the finite hypothesis and ask if one can achieve the optimality while enjoying bounded regret.
- Score: 25.627939043735683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study stochastic structured bandits for minimizing regret. The fact that
the popular optimistic algorithms do not achieve the asymptotic
instance-dependent regret optimality (asymptotic optimality for short) has
recently alluded researchers. On the other hand, it is known that one can
achieve bounded regret (i.e., does not grow indefinitely with $n$) in certain
instances. Unfortunately, existing asymptotically optimal algorithms rely on
forced sampling that introduces an $\omega(1)$ term w.r.t. the time horizon $n$
in their regret, failing to adapt to the "easiness" of the instance. In this
paper, we focus on the finite hypothesis case and ask if one can achieve the
asymptotic optimality while enjoying bounded regret whenever possible. We
provide a positive answer by introducing a new algorithm called CRush Optimism
with Pessimism (CROP) that eliminates optimistic hypotheses by pulling the
informative arms indicated by a pessimistic hypothesis. Our finite-time
analysis shows that CROP $(i)$ achieves a constant-factor asymptotic optimality
and, thanks to the forced-exploration-free design, $(ii)$ adapts to bounded
regret, and $(iii)$ its regret bound scales not with $K$ but with an effective
number of arms $K_\psi$ that we introduce. We also discuss a problem class
where CROP can be exponentially better than existing algorithms in
\textit{nonasymptotic} regimes. This problem class also reveals a surprising
fact that even a clairvoyant oracle who plays according to the asymptotically
optimal arm pull scheme may suffer a linear worst-case regret.
Related papers
- Optimal Batched Linear Bandits [9.795531604467406]
We introduce the E$4$ algorithm for the batched linear bandit problem, incorporating an Explore-Estimate-Exploit framework.
With a proper choice of exploration rate, we prove E$4$ achieves finite--time minimax optimal regret with only $O(loglog T)$ batches.
We show that with another choice of exploration rate E$4$ achieves an instance-dependent regret bound requiring at most $O(log T)$ batches.
arXiv Detail & Related papers (2024-06-06T14:57:52Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Instance-Optimality in Interactive Decision Making: Toward a
Non-Asymptotic Theory [30.061707627742766]
We aim for instance-optimality, a strong notion of adaptivity which asserts that, on any particular problem instance, the algorithm under consideration outperforms all consistent algorithms.
In this paper, we take the first step toward developing a non-asymptotic theory of instance-optimal decision making with general function approximation.
arXiv Detail & Related papers (2023-04-24T21:51:58Z) - Improved High-Probability Regret for Adversarial Bandits with
Time-Varying Feedback Graphs [62.52390282012508]
We study high-probability regret bounds for adversarial $K$-armed bandits with time-varying feedback graphs over $T$ rounds.
We develop an algorithm that achieves the optimal regret $widetildemathcalO((sum_t=1Talpha_t)1/2+max_tin[T]alpha_t]$ with high probability.
We also develop the first algorithm that achieves the optimal high-probability regret bound for weakly observable graphs.
arXiv Detail & Related papers (2022-10-04T04:36:15Z) - Stochastic Online Learning with Feedback Graphs: Finite-Time and
Asymptotic Optimality [39.2230418563572]
We show that, surprisingly, the notion of optimal finite-time regret is not a uniquely defined property in this context.
We give an algorithm that admits quasi-optimal regret both in finite-time andally.
arXiv Detail & Related papers (2022-06-20T22:11:08Z) - Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary.
Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z) - An Asymptotically Optimal Primal-Dual Incremental Algorithm for
Contextual Linear Bandits [129.1029690825929]
We introduce a novel algorithm improving over the state-of-the-art along multiple dimensions.
We establish minimax optimality for any learning horizon in the special case of non-contextual linear bandits.
arXiv Detail & Related papers (2020-10-23T09:12:47Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - Lenient Regret for Multi-Armed Bandits [72.56064196252498]
We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
arXiv Detail & Related papers (2020-08-10T08:30:52Z) - Predictive Power of Nearest Neighbors Algorithm under Random
Perturbation [21.79888306754263]
We consider a data corruption scenario in the classical $k$ Nearest Neighbors ($k$-NN)
Under such a scenario, the impact of corruption level on the regret is carefully characterized.
arXiv Detail & Related papers (2020-02-13T01:35:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.