Information Capacity Regret Bounds for Bandits with Mediator Feedback
- URL: http://arxiv.org/abs/2402.10282v1
- Date: Thu, 15 Feb 2024 19:18:47 GMT
- Title: Information Capacity Regret Bounds for Bandits with Mediator Feedback
- Authors: Khaled Eldowa, Nicol\`o Cesa-Bianchi, Alberto Maria Metelli, Marcello
Restelli
- Abstract summary: We introduce the policy set capacity as an information-theoretic measure for the complexity of the policy set.
Adopting the classical EXP4 algorithm, we provide new regret bounds depending on the policy set capacity.
For a selection of policy set families, we prove nearly-matching lower bounds, scaling similarly with the capacity.
- Score: 55.269551124587224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work addresses the mediator feedback problem, a bandit game where the
decision set consists of a number of policies, each associated with a
probability distribution over a common space of outcomes. Upon choosing a
policy, the learner observes an outcome sampled from its distribution and
incurs the loss assigned to this outcome in the present round. We introduce the
policy set capacity as an information-theoretic measure for the complexity of
the policy set. Adopting the classical EXP4 algorithm, we provide new regret
bounds depending on the policy set capacity in both the adversarial and the
stochastic settings. For a selection of policy set families, we prove
nearly-matching lower bounds, scaling similarly with the capacity. We also
consider the case when the policies' distributions can vary between rounds,
thus addressing the related bandits with expert advice problem, which we
improve upon its prior results. Additionally, we prove a lower bound showing
that exploiting the similarity between the policies is not possible in general
under linear bandit feedback. Finally, for a full-information variant, we
provide a regret bound scaling with the information radius of the policy set.
Related papers
- Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Piecewise-Stationary Combinatorial Semi-Bandit with Causally Related
Rewards [5.347237827669861]
We study the piecewise stationary semi-bandit problem with causally related rewards.
In our nonstationary environment, variations in the base arms' distributions, causal relationships between rewards, or both, change the reward generation process.
The problem becomes aggravated in the semi-bandit setting, where the decision-maker only observes the outcome of the selected bundle of arms.
arXiv Detail & Related papers (2023-07-26T12:06:13Z) - Reproducible Bandits [95.8830340560603]
A policy in the bandit environment is called reproducible if it pulls, with high probability, the exact same sequence of arms in two different executions.
We show that not only do reproducible policies exist, but also they achieve almost the same optimal (non-reproducible) regret bounds in terms of the time horizon.
Our results show that even though randomization is crucial for the exploration-exploitation trade-off, an optimal balance can still be achieved while pulling the exact same arms in two different rounds of executions.
arXiv Detail & Related papers (2022-10-04T20:36:45Z) - Reward-Free Policy Space Compression for Reinforcement Learning [39.04317877999891]
In reinforcement learning, we encode the potential behaviors of an agent interacting with an environment into an infinite set of policies.
We seek for a reward-free compression of the policy space into a finite set of representative policies.
We show that this compression of the policy space can be formulated as a set cover problem, and it is inherently NP-hard.
arXiv Detail & Related papers (2022-02-22T18:11:57Z) - Continuous Mean-Covariance Bandits [39.820490484375156]
We propose a novel Continuous Mean-Covariance Bandit model to take into account option correlation.
In CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions.
We propose novel algorithms with optimal regrets (within logarithmic factors) and provide matching lower bounds to validate their optimalities.
arXiv Detail & Related papers (2021-02-24T06:37:05Z) - Off-Policy Evaluation of Slate Policies under Bayes Risk [70.10677881866047]
We study the problem of off-policy evaluation for slate bandits, for the typical case in which the logging policy factorizes over the slots of the slate.
We show that the risk improvement over PI grows linearly with the number of slots, and linearly with the gap between the arithmetic and the harmonic mean of a set of slot-level divergences.
arXiv Detail & Related papers (2021-01-05T20:07:56Z) - Policy Optimization as Online Learning with Mediator Feedback [46.845765216238135]
Policy Optimization (PO) is a widely used approach to address continuous control tasks.
In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space.
We propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RIST) for regret minimization.
arXiv Detail & Related papers (2020-12-15T11:34:29Z) - Variational Policy Propagation for Multi-agent Reinforcement Learning [68.26579560607597]
We propose a emphcollaborative multi-agent reinforcement learning algorithm named variational policy propagation (VPP) to learn a emphjoint policy through the interactions over agents.
We prove that the joint policy is a Markov Random Field under some mild conditions, which in turn reduces the policy space effectively.
We integrate the variational inference as special differentiable layers in policy such as the actions can be efficiently sampled from the Markov Random Field and the overall policy is differentiable.
arXiv Detail & Related papers (2020-04-19T15:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.