Continuous Time Bandits With Sampling Costs
- URL: http://arxiv.org/abs/2107.05289v2
- Date: Wed, 19 Apr 2023 12:36:13 GMT
- Title: Continuous Time Bandits With Sampling Costs
- Authors: Rahul Vaze and Manjesh K. Hanawal
- Abstract summary: We consider a continuous-time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample.
There is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency.
The goal is to design a learning algorithm that minimizes regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm.
- Score: 17.412117389855222
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a continuous-time multi-arm bandit problem (CTMAB), where the
learner can sample arms any number of times in a given interval and obtain a
random reward from each sample, however, increasing the frequency of sampling
incurs an additive penalty/cost. Thus, there is a tradeoff between obtaining
large reward and incurring sampling cost as a function of the sampling
frequency. The goal is to design a learning algorithm that minimizes regret,
that is defined as the difference of the payoff of the oracle policy and that
of the learning algorithm. CTMAB is fundamentally different than the usual
multi-arm bandit problem (MAB), e.g., even the single-arm case is non-trivial
in CTMAB, since the optimal sampling frequency depends on the mean of the arm,
which needs to be estimated. We first establish lower bounds on the regret
achievable with any algorithm and then propose algorithms that achieve the
lower bound up to logarithmic factors. For the single-arm case, we show that
the lower bound on the regret is $\Omega((\log T)^2/\mu)$, where $\mu$ is the
mean of the arm, and $T$ is the time horizon. For the multiple arms case, we
show that the lower bound on the regret is $\Omega((\log T)^2 \mu/\Delta^2)$,
where $\mu$ now represents the mean of the best arm, and $\Delta$ is the
difference of the mean of the best and the second-best arm. We then propose an
algorithm that achieves the bound up to constant terms.
Related papers
- A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback.
We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z) - Optimal Top-Two Method for Best Arm Identification and Fluid Analysis [15.353009236788262]
We propose an optimal top-2 type algorithm for the best arm identification problem.
We show that the proposed algorithm is optimal as $delta rightarrow 0$.
arXiv Detail & Related papers (2024-03-14T06:14:07Z) - Combinatorial Bandits for Maximum Value Reward Function under Max
Value-Index Feedback [9.771002043127728]
We consider a multi-armed bandit problem for maximum value reward function under maximum value and index feedback.
We propose an algorithm and provide a regret bound for problem instances with arm outcomes according to arbitrary distributions with finite supports.
Our algorithm achieves a $O((k/Delta)log(T))$ distribution-dependent and a $tildeO(sqrtT)$ distribution-independent regret.
arXiv Detail & Related papers (2023-05-25T14:02:12Z) - Communication-Constrained Bandits under Additive Gaussian Noise [111.06688156723018]
We study a distributed multi-armed bandit where a client supplies the learner with communication-constrained feedback.
We propose a multi-phase bandit algorithm, $mathttUEtext-UCB++$, which matches this lower bound to a minor additive factor.
arXiv Detail & Related papers (2023-04-25T09:31:20Z) - Revisiting Simple Regret Minimization in Multi-Armed Bandits [33.88679679593314]
Simple regret is less popular than the probability of missing the best arm or an $epsilon$-good arm.
In this paper, we achieve improved simple regret upper bounds for both data-rich ($Tge n$) and data-poor regime ($T le n$)
For the more challenging data-poor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once.
arXiv Detail & Related papers (2022-10-30T18:31:03Z) - PAC Best Arm Identification Under a Deadline [101.10352416022559]
We study $(epsilon, delta)$-PAC best arm identification, where a decision-maker must identify an optimal arm with probability at least $1 - delta$, while minimizing the number of arm pulls (samples)
In this work, the decision-maker is given a deadline of $T$ rounds, where, on each round, it can adaptively choose which arms to pull and how many times to pull them.
We propose Elastic Batch Racing (EBR), a novel algorithm for this setting and bound its sample complexity, showing that EBR is optimal with respect to both
arXiv Detail & Related papers (2021-06-06T19:48:32Z) - Bandits with many optimal arms [68.17472536610859]
We write $p*$ for the proportion of optimal arms and $Delta$ for the minimal mean-gap between optimal and sub-optimal arms.
We characterize the optimal learning rates both in the cumulative regret setting, and in the best-arm identification setting.
arXiv Detail & Related papers (2021-03-23T11:02:31Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Near-Optimal Regret Bounds for Contextual Combinatorial Semi-Bandits
with Linear Payoff Functions [53.77572276969548]
We show that the C$2$UCB algorithm has the optimal regret bound $tildeO(dsqrtkT + dk)$ for the partition matroid constraints.
For general constraints, we propose an algorithm that modifies the reward estimates of arms in the C$2$UCB algorithm.
arXiv Detail & Related papers (2021-01-20T04:29:18Z) - Quantum exploration algorithms for multi-armed bandits [15.666205208594567]
We show that we can find the best arm with fixed confidence using $tildeObigl(sqrtsum_i=2nDeltasmash-2_ibigr)$ quantum queries.
This algorithm, based on variable-time amplitude amplification and estimation, gives a quadratic speedup compared to the best possible classical result.
arXiv Detail & Related papers (2020-07-14T14:15:20Z) - Blocking Bandits [33.14975454724348]
We consider a novel multi-armed bandit setting, where playing an arm makes it unavailable for a fixed number of time slots thereafter.
We show that with prior knowledge of the rewards and delays of all the arms, the problem of optimizing cumulative reward does not admit any pseudo-polynomial time algorithm.
We design a UCB based algorithm which is shown to have $c log T + o(log T)$ cumulative regret against the greedy algorithm.
arXiv Detail & Related papers (2019-07-27T20:42:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.