Nonstochastic Bandits with Infinitely Many Experts
- URL: http://arxiv.org/abs/2102.05164v1
- Date: Tue, 9 Feb 2021 22:42:36 GMT
- Title: Nonstochastic Bandits with Infinitely Many Experts
- Authors: X. Flora Meng, Tuhin Sarkar, Munther A. Dahleh
- Abstract summary: We study the problem of nonstochastic bandits with infinitely many experts.
We propose a variant of Exp4.P that, for finitely many experts, enables inference of correct expert rankings.
We then incorporate the variant into a meta-algorithm that works on infinitely many experts.
- Score: 1.7188280334580197
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of nonstochastic bandits with infinitely many experts: A
learner aims to maximize the total reward by taking actions sequentially based
on bandit feedback while benchmarking against a countably infinite set of
experts. We propose a variant of Exp4.P that, for finitely many experts,
enables inference of correct expert rankings while preserving the order of the
regret upper bound. We then incorporate the variant into a meta-algorithm that
works on infinitely many experts. We prove a high-probability upper bound of
$\tilde{\mathcal{O}} \big( i^*K + \sqrt{KT} \big)$ on the regret, up to polylog
factors, where $i^*$ is the unknown position of the best expert, $K$ is the
number of actions, and $T$ is the time horizon. We also provide an example of
structured experts and discuss how to expedite learning in such case. Our
meta-learning algorithm achieves the tightest regret upper bound for the
setting considered when $i^* = \tilde{\mathcal{O}} \big( \sqrt{T/K} \big)$. If
a prior distribution is assumed to exist for $i^*$, the probability of
satisfying a tight regret bound increases with $T$, the rate of which can be
fast.
Related papers
- Improved Regret Bounds for Bandits with Expert Advice [16.699381591572163]
We prove a lower bound of order $sqrtK T ln(N/K)$ for the worst-case regret, where $K$ is the number of actions, $N>K$ the number of experts, and $T$ the time horizon.
This matches a previously known upper bound of the same order and improves upon the best available lower bound of $sqrtK T (ln N) / (ln K)$.
arXiv Detail & Related papers (2024-06-24T17:14:31Z) - Near-optimal Per-Action Regret Bounds for Sleeping Bandits [8.261117235807607]
We derive near-optimal per-action regret bounds for sleeping bandits.
In a setting with $K$ total arms and at most $A$ available arms in each round over $T$ rounds, the best known upper bound is $O(KsqrtTAlnK)$.
We extend our results to the setting of bandits with advice from sleeping experts, generalizing EXP4 along the way.
arXiv Detail & Related papers (2024-03-02T21:22:46Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Streaming Algorithms for Learning with Experts: Deterministic Versus
Robust [62.98860182111096]
In the online learning with experts problem, an algorithm must make a prediction about an outcome on each of $T$ days (or times)
The goal is to make a prediction with the minimum cost, specifically compared to the best expert in the set.
We show a space lower bound of $widetildeOmegaleft(fracnMRTright)$ for any deterministic algorithm that achieves regret $R$ when the best expert makes $M$ mistakes.
arXiv Detail & Related papers (2023-03-03T04:39:53Z) - Constant regret for sequence prediction with limited advice [0.0]
We provide a strategy combining only p = 2 experts per round for prediction and observing m $ge$ 2 experts' losses.
If the learner is constrained to observe only one expert feedback per round, the worst-case regret is the "slow rate" $Omega$($sqrt$ KT)
arXiv Detail & Related papers (2022-10-05T13:32:49Z) - Bandits with Stochastic Experts: Constant Regret, Empirical Experts and Episodes [36.104981594178525]
We study a variant of the contextual bandit problem where an agent can intervene through a set of expert policies.
We propose the Divergence-based Upper Confidence Bound (D-UCB) algorithm that uses importance sampling to share information across experts.
We also provide the Empirical D-UCB (ED-UCB) algorithm that can function with only approximate knowledge of expert distributions.
arXiv Detail & Related papers (2021-07-07T14:58:14Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Improved Sleeping Bandits with Stochastic Actions Sets and Adversarial
Rewards [59.559028338399855]
We consider the problem of sleeping bandits with action sets and adversarial rewards.
In this paper, we provide a new computationally efficient algorithm inspired by EXP3 satisfying a regret of order $O(sqrtT)$.
arXiv Detail & Related papers (2020-04-14T00:41:26Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.