Asymptotically Optimal Bandits under Weighted Information
- URL: http://arxiv.org/abs/2105.14114v1
- Date: Fri, 28 May 2021 21:28:44 GMT
- Title: Asymptotically Optimal Bandits under Weighted Information
- Authors: Matias I. M\"uller and Cristian R. Rojas
- Abstract summary: We study the problem of regret in a multi-armed bandit setup where the agent is allowed to play multiple arms at each round.
We propose a Thompson-Sampling-based strategy, that designs the power profile as its posterior belief of each arm being the best arm.
- Score: 10.939683083130616
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We study the problem of regret minimization in a multi-armed bandit setup
where the agent is allowed to play multiple arms at each round by spreading the
resources usually allocated to only one arm. At each iteration the agent
selects a normalized power profile and receives a Gaussian vector as outcome,
where the unknown variance of each sample is inversely proportional to the
power allocated to that arm. The reward corresponds to a linear combination of
the power profile and the outcomes, resembling a linear bandit. By spreading
the power, the agent can choose to collect information much faster than in a
traditional multi-armed bandit at the price of reducing the accuracy of the
samples. This setup is fundamentally different from that of a linear bandit --
the regret is known to scale as $\Theta(\sqrt{T})$ for linear bandits, while in
this setup the agent receives a much more detailed feedback, for which we
derive a tight $\log(T)$ problem-dependent lower-bound. We propose a
Thompson-Sampling-based strategy, called Weighted Thompson Sampling (\WTS),
that designs the power profile as its posterior belief of each arm being the
best arm, and show that its upper bound matches the derived logarithmic lower
bound. Finally, we apply this strategy to a problem of control and system
identification, where the goal is to estimate the maximum gain (also called
$\mathcal{H}_\infty$-norm) of a linear dynamical system based on batches of
input-output samples.
Related papers
- Influential Bandits: Pulling an Arm May Change the Environment [44.71145269686588]
Real-world applications often involve non-stationary environments and interdependencies between arms.
We propose the influential bandit problem, which models inter-arm interactions through an unknown, symmetric, positive semi-definite interaction matrix.
We introduce a new algorithm based on a lower confidence bound (LCB) estimator tailored to the structure of the loss dynamics.
arXiv Detail & Related papers (2025-04-11T02:05:51Z) - Semi-Parametric Batched Global Multi-Armed Bandits with Covariates [0.48342038441006807]
The multi-armed bandits (MAB) framework is a widely used approach for sequential decision-making.
We propose a novel semi-parametric framework for batched bandits with coparametrics and a shared parameter across arms.
Our algorithm, Batched single-Index Dynamic binning and Successive arm elimination (BIDS), employs a batched successive arm elimination strategy.
arXiv Detail & Related papers (2025-03-01T17:23:55Z) - Continuous K-Max Bandits [54.21533414838677]
We study the $K$-Max multi-armed bandits problem with continuous outcome distributions and weak value-index feedback.
This setting captures critical applications in recommendation systems, distributed computing, server scheduling, etc.
Our key contribution is the computationally efficient algorithm DCK-UCB, which combines adaptive discretization with bias-corrected confidence bounds.
arXiv Detail & Related papers (2025-02-19T06:37:37Z) - An Algorithm for Fixed Budget Best Arm Identification with Combinatorial Exploration [3.9901365062418312]
We consider the best arm identification problem in the $K-$armed bandit framework.
Agent is allowed to play a subset of arms at each time slot instead of one arm.
We propose an algorithm that constructs $log K$ groups and performs a likelihood ratio test to detect the presence of the best arm.
arXiv Detail & Related papers (2025-02-03T15:10:08Z) - Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting [67.1631453378926]
Graph-Triggered Bandits is a framework to generalize rested and restless bandits.
In this work, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs.
arXiv Detail & Related papers (2024-09-09T18:23:07Z) - A General Framework for Clustering and Distribution Matching with Bandit Feedback [81.50716021326194]
We develop a general framework for clustering and distribution matching problems with bandit feedback.
We derive a non-asymptotic lower bound on the average number of arm pulls for any online algorithm with an error probability not exceeding $delta$.
arXiv Detail & Related papers (2024-09-08T12:19:12Z) - Thompson Exploration with Best Challenger Rule in Best Arm
Identification [66.33448474838342]
We study the fixed-confidence best arm identification problem in the bandit framework.
We propose a novel policy that combines Thompson sampling with a computationally efficient approach known as the best challenger rule.
arXiv Detail & Related papers (2023-10-01T01:37:02Z) - Beyond the Best: Estimating Distribution Functionals in Infinite-Armed
Bandits [40.71199236098642]
In the infinite-armed bandit problem, each arm's average reward is sampled from an unknown distribution.
We consider a general class of distribution functionals beyond the maximum, and propose unified meta algorithms for both the offline and online settings.
arXiv Detail & Related papers (2022-11-01T18:20:10Z) - Risk-averse Contextual Multi-armed Bandit Problem with Linear Payoffs [7.125769932993104]
We consider the contextual multi-armed bandit problem for linear payoffs under a risk-averse criterion.
At each round, contexts are revealed for each arm, and the decision maker chooses one arm to pull and receives the corresponding reward.
We apply the Thompson Sampling algorithm for the disjoint model, and provide a comprehensive regret analysis for a variant of the proposed algorithm.
arXiv Detail & Related papers (2022-06-24T18:48:35Z) - Finite-Time Regret of Thompson Sampling Algorithms for Exponential
Family Multi-Armed Bandits [88.21288104408556]
We study the regret of Thompson sampling (TS) algorithms for exponential family bandits, where the reward distribution is from a one-dimensional exponential family.
We propose a Thompson sampling, termed Expulli, which uses a novel sampling distribution to avoid the under-estimation of the optimal arm.
arXiv Detail & Related papers (2022-06-07T18:08:21Z) - Scale Free Adversarial Multi Armed Bandits [13.757095663704858]
We consider the Scale-Free Adversarial Multi Armed Bandit(MAB) problem.
We design a Follow The Regularized Leader(FTRL) algorithm, which comes with the first scale-free regret guarantee for MAB.
We also develop a new technique for obtaining local-norm lower bounds for Bregman Divergences.
arXiv Detail & Related papers (2021-06-08T21:26:57Z) - Recurrent Submodular Welfare and Matroid Blocking Bandits [22.65352007353614]
A recent line of research focuses on the study of the multi-armed bandits problem (MAB)
We develop new algorithmic ideas that allow us to obtain a $ (1 - frac1e)$-approximation for any matroid.
A key ingredient is the technique of correlated (interleaved) scheduling.
arXiv Detail & Related papers (2021-01-30T21:51:47Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z) - Contextual Blocking Bandits [35.235375147227124]
We study a novel variant of the multi-armed bandit problem, where at each time step, the player observes an independently sampled context that determines the arms' mean rewards.
Playing an arm blocks it (across all contexts) for a fixed and known number of future time steps.
We propose a UCB-based variant of the full-information algorithm that guarantees a $mathcalO(log T)$-regret w.r.t. an $alpha$regret strategy in $T time steps, matching the $Omega(log(T)$ lower bound
arXiv Detail & Related papers (2020-03-06T20:34:42Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.