The Impact of Batch Learning in Stochastic Linear Bandits
- URL: http://arxiv.org/abs/2202.06657v1
- Date: Mon, 14 Feb 2022 12:27:06 GMT
- Title: The Impact of Batch Learning in Stochastic Linear Bandits
- Authors: Danil Provodin, Pratik Gajane, Mykola Pechenizkiy, Maurits Kaptein
- Abstract summary: We consider a special case of bandit problems, named batched bandits, in which an agent observes batches of responses over a certain time period.
Our main theoretical results show that the impact of batch learning can be measured proportional to the regret of online behavior.
- Score: 7.3449418475577595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider a special case of bandit problems, named batched bandits, in
which an agent observes batches of responses over a certain time period. Unlike
previous work, we consider a practically relevant batch-centric scenario of
batch learning. That is to say, we provide a policy-agnostic regret analysis
and demonstrate upper and lower bounds for the regret of a candidate policy.
Our main theoretical results show that the impact of batch learning can be
measured proportional to the regret of online behavior. Primarily, we study two
settings of the problem: instance-independent and instance-dependent. While the
upper bound is the same for both settings, the worst-case lower bound is more
comprehensive in the former case and more accurate in the latter one. Also, we
provide a more robust result for the 2-armed bandit problem as an important
insight. Finally, we demonstrate the consistency of theoretical results by
conducting empirical experiments and reflect on the optimal batch size choice.
Related papers
- Causal Contextual Bandits with Adaptive Context [12.205797997133397]
We study a variant of causal contextual bandits where the context is chosen based on an initial intervention chosen by the learner.
We prove that our simple regret is essentially tight for a large class of instances.
arXiv Detail & Related papers (2024-05-28T22:17:57Z) - Thompson Sampling in Partially Observable Contextual Bandits [2.465689259704613]
We study bandit policies for learning to select optimal arms based on the data of observations.
Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation.
These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.
arXiv Detail & Related papers (2024-02-15T19:37:39Z) - On Penalization in Stochastic Multi-armed Bandits [22.04356596828437]
We study an important variant of the multi-armed bandit (MAB) problem, which takes penalization into consideration.
We propose a hard-threshold UCB-like algorithm, which enjoys many merits including fairness, nearly optimal regret, better tradeoff between reward and fairness.
arXiv Detail & Related papers (2022-11-15T17:13:09Z) - Batched Dueling Bandits [13.69077222007053]
We study the batched $K$-armed dueling bandit problem under two standard settings.
We obtain algorithms with a smooth trade-off between the number of batches and regret.
arXiv Detail & Related papers (2022-02-22T04:02:36Z) - The Impact of Batch Learning in Stochastic Bandits [5.008064542274928]
We consider a special case of bandit problems, namely batched bandits.
Motivated by natural restrictions of recommender systems and e-commerce platforms, we assume that a learning agent observes responses batched in groups over a certain time period.
We provide a policy-agnostic regret analysis and demonstrate upper and lower bounds for the regret of a candidate policy.
arXiv Detail & Related papers (2021-11-03T08:38:10Z) - Adversarial Robustness with Semi-Infinite Constrained Learning [177.42714838799924]
Deep learning to inputs perturbations has raised serious questions about its use in safety-critical domains.
We propose a hybrid Langevin Monte Carlo training approach to mitigate this issue.
We show that our approach can mitigate the trade-off between state-of-the-art performance and robust robustness.
arXiv Detail & Related papers (2021-10-29T13:30:42Z) - Batched Neural Bandits [107.5072688105936]
BatchNeuralUCB combines neural networks with optimism to address the exploration-exploitation tradeoff.
We prove that BatchNeuralUCB achieves the same regret as the fully sequential version while reducing the number of policy updates considerably.
arXiv Detail & Related papers (2021-02-25T17:36:44Z) - Instance-Dependent Complexity of Contextual Bandits and Reinforcement
Learning: A Disagreement-Based Perspective [104.67295710363679]
In the classical multi-armed bandit problem, instance-dependent algorithms attain improved performance on "easy" problems with a gap between the best and second-best arm.
We introduce a family of complexity measures that are both sufficient and necessary to obtain instance-dependent regret bounds.
We then introduce new oracle-efficient algorithms which adapt to the gap whenever possible, while also attaining the minimax rate in the worst case.
arXiv Detail & Related papers (2020-10-07T01:33:06Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z) - Predictive Bandits [68.8204255655161]
We introduce and study a new class of bandit problems, referred to as predictive bandits.
In each round, the decision maker first decides whether to gather information about the rewards of particular arms.
The decision maker then selects an arm to be actually played in the round.
arXiv Detail & Related papers (2020-04-02T17:12:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.