Worst-case Performance of Greedy Policies in Bandits with Imperfect
Context Observations
- URL: http://arxiv.org/abs/2204.04773v1
- Date: Sun, 10 Apr 2022 21:27:56 GMT
- Title: Worst-case Performance of Greedy Policies in Bandits with Imperfect
Context Observations
- Authors: Hongju Park and Mohamad Kazem Shirani Faradonbeh
- Abstract summary: This work considers Greedy reinforcement learning policies that take actions as if the current estimates of the parameter and of the unobserved contexts coincide with the corresponding true values.
We establish that the non-asymptotic worst-case regret grows logarithmically with the time horizon and the failure probability, while it scales linearly with the number of arms.
- Score: 1.370633147306388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual bandits are canonical models for sequential decision-making under
uncertainty in environments with time-varying components. In this setting, the
expected reward of each bandit arm consists of the inner product of an unknown
parameter and the context vector of that arm, perturbed with a random error.
The classical setting heavily relies on fully observed contexts, while study of
the richer model of imperfectly observed contextual bandits is immature. This
work considers Greedy reinforcement learning policies that take actions as if
the current estimates of the parameter and of the unobserved contexts coincide
with the corresponding true values. We establish that the non-asymptotic
worst-case regret grows logarithmically with the time horizon and the failure
probability, while it scales linearly with the number of arms. Numerical
analysis showcasing the above efficiency of Greedy policies is also provided.
Related papers
- Bridging Rested and Restless Bandits with Graph-Triggering: Rising and Rotting [67.1631453378926]
Graph-Triggered Bandits is a framework to generalize rested and restless bandits.
In this work, we focus on two specific types of monotonic bandits: rising, where the expected reward of an arm grows as the number of triggers increases, and rotting, where the opposite behavior occurs.
arXiv Detail & Related papers (2024-09-09T18:23:07Z) - Non-stochastic Bandits With Evolving Observations [47.61533665679308]
We introduce a novel online learning framework that unifies and generalizes pre-established models.
We propose regret minimization algorithms for both the full-information and bandit settings.
Our algorithms match the known regret bounds across many special cases, while also introducing previously unknown bounds.
arXiv Detail & Related papers (2024-05-27T05:32:46Z) - Thompson Sampling in Partially Observable Contextual Bandits [2.465689259704613]
We study bandit policies for learning to select optimal arms based on the data of observations.
Our theoretical analysis shows that the Thompson sampling policy successfully balances exploration and exploitation.
These techniques pave the road towards studying other decision-making problems with contextual information as well as partial observations.
arXiv Detail & Related papers (2024-02-15T19:37:39Z) - Online learning in bandits with predicted context [8.257280652461159]
We consider the contextual bandit problem where at each time, the agent only has access to a noisy version of the context.
This setting is motivated by a wide range of applications where the true context for decision-making is unobserved.
We propose the first online algorithm in this setting with sublinear regret guarantees under mild conditions.
arXiv Detail & Related papers (2023-07-26T02:33:54Z) - Reproducible Bandits [95.8830340560603]
A policy in the bandit environment is called reproducible if it pulls, with high probability, the exact same sequence of arms in two different executions.
We show that not only do reproducible policies exist, but also they achieve almost the same optimal (non-reproducible) regret bounds in terms of the time horizon.
Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.
arXiv Detail & Related papers (2022-10-04T20:36:45Z) - Analysis of Thompson Sampling for Partially Observable Contextual
Multi-Armed Bandits [1.8275108630751844]
We propose a Thompson Sampling algorithm for partially observable contextual multi-armed bandits.
We show that the regret of the presented policy scales logarithmically with time and the number of arms, and linearly with the dimension.
arXiv Detail & Related papers (2021-10-23T08:51:49Z) - Online Model Selection: a Rested Bandit Formulation [49.69377391589057]
We introduce and analyze a best arm identification problem in the rested bandit setting.
We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game.
Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function.
arXiv Detail & Related papers (2020-12-07T08:23:08Z) - Greedy Bandits with Sampled Context [0.0]
Greedy Bandits with Sampled Context (GB-SC) is a method for contextual multi-armed bandits to develop the prior from context information.
Our results show competitive performance on the Mushroom environment in terms of expected regret and expected cumulative regret.
arXiv Detail & Related papers (2020-07-27T17:17:45Z) - Offline Contextual Bandits with Overparameterized Models [52.788628474552276]
We ask whether the same phenomenon occurs for offline contextual bandits.
We show that this discrepancy is due to the emphaction-stability of their objectives.
In experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
arXiv Detail & Related papers (2020-06-27T13:52:07Z) - Robustness Guarantees for Mode Estimation with an Application to Bandits [131.21717367564963]
We introduce a theory for multi-armed bandits where the values are the modes of the reward distributions instead of the mean.
We show in simulations that our algorithms are robust to perturbation of the arms by adversarial noise sequences.
arXiv Detail & Related papers (2020-03-05T21:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.