Lenient Regret for Multi-Armed Bandits
- URL: http://arxiv.org/abs/2008.03959v4
- Date: Sun, 12 Sep 2021 12:22:50 GMT
- Title: Lenient Regret for Multi-Armed Bandits
- Authors: Nadav Merlis, Shie Mannor
- Abstract summary: We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took.
While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results.
We suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $epsilon$.
- Score: 72.56064196252498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially
chooses actions and observes rewards for the actions it took. While the
majority of algorithms try to minimize the regret, i.e., the cumulative
difference between the reward of the best action and the agent's action, this
criterion might lead to undesirable results. For example, in large problems, or
when the interaction with the environment is brief, finding an optimal arm is
infeasible, and regret-minimizing algorithms tend to over-explore. To overcome
this issue, algorithms for such settings should instead focus on playing
near-optimal arms. To this end, we suggest a new, more lenient, regret
criterion that ignores suboptimality gaps smaller than some $\epsilon$. We then
present a variant of the Thompson Sampling (TS) algorithm, called
$\epsilon$-TS, and prove its asymptotic optimality in terms of the lenient
regret. Importantly, we show that when the mean of the optimal arm is high
enough, the lenient regret of $\epsilon$-TS is bounded by a constant. Finally,
we show that $\epsilon$-TS can be applied to improve the performance when the
agent knows a lower bound of the suboptimality gaps.
Related papers
- Learning for Bandits under Action Erasures [20.235500642661812]
We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels.
In our model, while the distributed agents know if an action is erased, the central learner does not.
We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures.
arXiv Detail & Related papers (2024-06-26T05:03:00Z) - Optimal Multi-Fidelity Best-Arm Identification [65.23078799972188]
In bandit best-arm identification, an algorithm is tasked with finding the arm with highest mean reward with a specified accuracy as fast as possible.
We study multi-fidelity best-arm identification, in which the can choose to sample an arm at a lower fidelity (less accurate mean estimate) for a lower cost.
Several methods have been proposed for tackling this problem, but their optimality remain elusive, notably due to loose lower bounds on the total cost needed to identify the best arm.
arXiv Detail & Related papers (2024-06-05T08:02:40Z) - Multi-Armed Bandits with Abstention [62.749500564313834]
We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic element: abstention.
In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to abstain from accepting the instantaneous reward before observing it.
arXiv Detail & Related papers (2024-02-23T06:27:12Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Adaptive Regret for Bandits Made Possible: Two Queries Suffice [26.769372199571002]
We give query and regret optimal bandit algorithms under the strict notion of strongly adaptive regret.
Surprisingly, with just two queries per round, we give Strongly Adaptive Bandit Learner (StABL) that achieves $tildeO(sqrtn|I|)$ adaptive regret for multi-armed bandits.
arXiv Detail & Related papers (2024-01-17T15:32:04Z) - Combinatorial Stochastic-Greedy Bandit [79.1700188160944]
We propose a novelgreedy bandit (SGB) algorithm for multi-armed bandit problems when no extra information other than the joint reward of the selected set of $n$ arms at each time $tin [T]$ is observed.
SGB adopts an optimized-explore-then-commit approach and is specifically designed for scenarios with a large set of base arms.
arXiv Detail & Related papers (2023-12-13T11:08:25Z) - Optimal and Efficient Dynamic Regret Algorithms for Non-Stationary
Dueling Bandits [27.279654173896372]
We study the problem of emphdynamic regret minimization in $K$-armed Dueling Bandits under non-stationary or time varying preferences.
This is an online learning setup where the agent chooses a pair of items at each round and observes only a relative binary win-loss' feedback for this pair.
arXiv Detail & Related papers (2021-11-06T16:46:55Z) - Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary.
Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z) - Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic
Optimality [25.627939043735683]
We study structured bandits for minimizing regret.
In this paper we focus on the finite hypothesis and ask if one can achieve the optimality while enjoying bounded regret.
arXiv Detail & Related papers (2020-06-15T20:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.