Pure Exploration under Mediators' Feedback
- URL: http://arxiv.org/abs/2308.15552v2
- Date: Fri, 12 Jan 2024 12:28:23 GMT
- Title: Pure Exploration under Mediators' Feedback
- Authors: Riccardo Poiani, Alberto Maria Metelli, Marcello Restelli
- Abstract summary: Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
- Score: 63.56002444692792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stochastic multi-armed bandits are a sequential-decision-making framework,
where, at each interaction step, the learner selects an arm and observes a
stochastic reward. Within the context of best-arm identification (BAI)
problems, the goal of the agent lies in finding the optimal arm, i.e., the one
with highest expected reward, as accurately and efficiently as possible.
Nevertheless, the sequential interaction protocol of classical BAI problems,
where the agent has complete control over the arm being pulled at each round,
does not effectively model several decision-making problems of interest (e.g.,
off-policy learning, partially controllable environments, and human feedback).
For this reason, in this work, we propose a novel strict generalization of the
classical BAI problem that we refer to as best-arm identification under
mediators' feedback (BAI-MF). More specifically, we consider the scenario in
which the learner has access to a set of mediators, each of which selects the
arms on the agent's behalf according to a stochastic and possibly unknown
policy. The mediator, then, communicates back to the agent the pulled arm
together with the observed reward. In this setting, the agent's goal lies in
sequentially choosing which mediator to query to identify with high probability
the optimal arm while minimizing the identification time, i.e., the sample
complexity. To this end, we first derive and analyze a statistical lower bound
on the sample complexity specific to our general mediator feedback scenario.
Then, we propose a sequential decision-making strategy for discovering the best
arm under the assumption that the mediators' policies are known to the learner.
As our theory verifies, this algorithm matches the lower bound both almost
surely and in expectation. Finally, we extend these results to cases where the
mediators' policies are unknown to the learner obtaining comparable results.
Related papers
- Stochastic Bandits for Egalitarian Assignment [58.33714486693828]
We study EgalMAB, an egalitarian assignment problem in the context of multi-armed bandits.
We design and analyze a UCB-based policy EgalUCB and establish upper bounds on the cumulative regret.
arXiv Detail & Related papers (2024-10-08T09:49:47Z) - Optimal Regret Bounds for Collaborative Learning in Bandits [10.76667043339504]
We consider regret in a general collaborative multi-agent multi-armed bandit model.
We propose the first algorithm with order optimal regret bounds under this model.
arXiv Detail & Related papers (2023-12-15T10:36:13Z) - Online Decision Mediation [72.80902932543474]
Consider learning a decision support assistant to serve as an intermediary between (oracle) expert behavior and (imperfect) human behavior.
In clinical diagnosis, fully-autonomous machine behavior is often beyond ethical affordances.
arXiv Detail & Related papers (2023-10-28T05:59:43Z) - On the Complexity of Multi-Agent Decision Making: From Learning in Games
to Partial Monitoring [105.13668993076801]
A central problem in the theory of multi-agent reinforcement learning (MARL) is to understand what structural conditions and algorithmic principles lead to sample-efficient learning guarantees.
We study this question in a general framework for interactive decision making with multiple agents.
We show that characterizing the statistical complexity for multi-agent decision making is equivalent to characterizing the statistical complexity of single-agent decision making.
arXiv Detail & Related papers (2023-05-01T06:46:22Z) - Near-Optimal Collaborative Learning in Bandits [15.456561090871244]
This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms.
The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents.
We propose a near-optimal algorithm for pure exploration.
arXiv Detail & Related papers (2022-05-31T21:11:47Z) - Best Arm Identification under Additive Transfer Bandits [49.69203462561861]
We study the setting when, despite the means being unknown, there is a known additive relationship between the source and target MAB instances.
We propose and theoretically analyze an LUCB-style algorithm to identify an $epsilon$-optimal target arm with high probability.
arXiv Detail & Related papers (2021-12-08T02:20:18Z) - Exploiting Heterogeneity in Robust Federated Best-Arm Identification [19.777265059976337]
Fed-SEL is a simple communication-efficient algorithm that builds on successive elimination techniques and involves local sampling steps at the clients.
We show that for certain heterogeneous problem instances, Fed-SEL outputs the best-arm after just one round of communication.
As our final contribution, we develop variants of Fed-SEL, both for federated and peer-to-peer settings, that are robust to the presence of Byzantine clients.
arXiv Detail & Related papers (2021-09-13T04:22:21Z) - Peer Selection with Noisy Assessments [43.307040330622186]
We extend PeerNomination, the most accurate peer reviewing algorithm to date, into WeightedPeerNomination.
We show analytically that a weighting scheme can improve the overall accuracy of the selection significantly.
arXiv Detail & Related papers (2021-07-21T14:47:11Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Continuous Mean-Covariance Bandits [39.820490484375156]
We propose a novel Continuous Mean-Covariance Bandit model to take into account option correlation.
In CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions.
We propose novel algorithms with optimal regrets (within logarithmic factors) and provide matching lower bounds to validate their optimalities.
arXiv Detail & Related papers (2021-02-24T06:37:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.