Related papers: Distributed No-Regret Learning for Multi-Stage Systems with End-to-End Bandit Feedback

Distributed No-Regret Learning for Multi-Stage Systems with End-to-End Bandit Feedback

URL: http://arxiv.org/abs/2404.04509v2
Date: Fri, 16 Aug 2024 19:47:32 GMT
Title: Distributed No-Regret Learning for Multi-Stage Systems with End-to-End Bandit Feedback
Authors: I-Hong Hou,
Abstract summary: This paper studies multi-stage systems with end-to-end bandit feedback. Each job needs to go through multiple stages, each managed by a different agent, before generating an outcome. The goal of this paper is to develop distributed online learning algorithms that achieve sublinear regret in adversarial environments.
Score: 7.8539454948826375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper studies multi-stage systems with end-to-end bandit feedback. In such systems, each job needs to go through multiple stages, each managed by a different agent, before generating an outcome. Each agent can only control its own action and learn the final outcome of the job. It has neither knowledge nor control on actions taken by agents in the next stage. The goal of this paper is to develop distributed online learning algorithms that achieve sublinear regret in adversarial environments. The setting of this paper significantly expands the traditional multi-armed bandit problem, which considers only one agent and one stage. In addition to the exploration-exploitation dilemma in the traditional multi-armed bandit problem, we show that the consideration of multiple stages introduces a third component, education, where an agent needs to choose its actions to facilitate the learning of agents in the next stage. To solve this newly introduced exploration-exploitation-education trilemma, we propose a simple distributed online learning algorithm, $\epsilon-$EXP3. We theoretically prove that the $\epsilon-$EXP3 algorithm is a no-regret policy that achieves sublinear regret. Simulation results show that the $\epsilon-$EXP3 algorithm significantly outperforms existing no-regret online learning algorithms for the traditional multi-armed bandit problem.

Related papers

Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents [16.514561132180134]
We study the repeated principal-agent bandit game, where the principal indirectly interacts with the unknown environment by proposing incentives for the agent to play arms. Most existing work assumes the agent has full knowledge of the reward means and always behaves greedily, but in many online marketplaces, the agent needs to learn the unknown environment and sometimes explore. Motivated by such settings, we model a self-interested learning agent with exploration behaviors who iteratively updates reward estimates and either selects an arm that maximizes the estimated reward plus incentive or explores arbitrarily with a certain probability.
arXiv Detail & Related papers (2024-12-20T20:04:50Z)
Incentivized Learning in Principal-Agent Bandit Games [62.41639598376539]
This work considers a repeated principal-agent bandit game, where the principal can only interact with her environment through the agent. The principal can influence the agent's decisions by offering incentives which add up to his rewards. We present nearly optimal learning algorithms for the principal's regret in both multi-armed and linear contextual settings.
arXiv Detail & Related papers (2024-03-06T16:00:46Z)
Impact of Decentralized Learning on Player Utilities in Stackelberg Games [57.08270857260131]
In many two-agent systems, each agent learns separately and the rewards of the two agents are not perfectly aligned. We model these systems as Stackelberg games with decentralized learning and show that standard regret benchmarks result in worst-case linear regret for at least one player. We develop algorithms to achieve near-optimal $O(T2/3)$ regret for both players with respect to these benchmarks.
arXiv Detail & Related papers (2024-02-29T23:38:28Z)
Multi-Agent Bandit Learning through Heterogeneous Action Erasure Channels [21.860440468189044]
Multi-Armed Bandit (MAB) systems are witnessing an upswing in applications within multi-agent distributed environments. In such settings, communication between agents executing actions and the primary learner making decisions can hinder the learning process. We introduce novel algorithms that enable learners to interact concurrently with distributed agents across heterogeneous action erasure channels.
arXiv Detail & Related papers (2023-12-21T19:21:19Z)
Learning to Incentivize Information Acquisition: Proper Scoring Rules Meet Principal-Agent Model [64.94131130042275]
We study the incentivized information acquisition problem, where a principal hires an agent to gather information on her behalf. We design a provably sample efficient algorithm that tailors the UCB algorithm to our model. Our algorithm features a delicate estimation procedure for the optimal profit of the principal, and a conservative correction scheme that ensures the desired agent's actions are incentivized.
arXiv Detail & Related papers (2023-03-15T13:40:16Z)
Bandit Social Learning: Exploration under Myopic Behavior [58.75758600464338]
We study social learning dynamics motivated by reviews on online platforms. Agents collectively follow a simple multi-armed bandit protocol, but each agent acts myopically, without regards to exploration. We derive stark learning failures for any such behavior, and provide matching positive results.
arXiv Detail & Related papers (2023-02-15T01:57:57Z)
Multi-Agent Neural Rewriter for Vehicle Routing with Limited Disclosure of Costs [65.23158435596518]
Solving the multi-vehicle routing problem as a team Markov game with partially observable costs. Our multi-agent reinforcement learning approach, the so-called multi-agent Neural Rewriter, builds on the single-agent Neural Rewriter to solve the problem by iteratively rewriting solutions.
arXiv Detail & Related papers (2022-06-13T09:17:40Z)
SA-MATD3:Self-attention-based multi-agent continuous control method in cooperative environments [12.959163198988536]
Existing algorithms suffer from the problem of uneven learning degree with the increase of the number of agents. A new structure for a multi-agent actor critic is proposed, and the self-attention mechanism is applied in the critic network. The proposed algorithm makes full use of the samples in the replay memory buffer to learn the behavior of a class of agents.
arXiv Detail & Related papers (2021-07-01T08:15:05Z)
Decentralized Multi-Agent Linear Bandits with Safety Constraints [31.67685495996986]
We study decentralized linear bandits, where a network of $N$ agents acts cooperatively to solve a linear bandit-optimization problem. We propose DLUCB: a fully decentralized algorithm that minimizes the cumulative regret over the entire network. We show that our ideas extend naturally to the emerging, albeit more challenging, setting of safe bandits.
arXiv Detail & Related papers (2020-12-01T07:33:00Z)
Latent Bandits Revisited [55.88616813182679]
A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state. We propose general algorithms for this setting, based on both upper confidence bounds (UCBs) and Thompson sampling. We provide a unified theoretical analysis of our algorithms, which have lower regret than classic bandit policies when the number of latent states is smaller than actions.
arXiv Detail & Related papers (2020-06-15T19:24:02Z)
Optimally Deceiving a Learning Leader in Stackelberg Games [123.14187606686006]
Recent results in the ML community have revealed that learning algorithms used to compute the optimal strategy for the leader to commit to in a Stackelberg game, are susceptible to manipulation by the follower. This paper shows that it is always possible for the follower to compute (near-optimal) payoffs for various scenarios about the learning interaction between leader and follower.
arXiv Detail & Related papers (2020-06-11T16:18:21Z)
Regret Bounds for Decentralized Learning in Cooperative Multi-Agent Dynamical Systems [3.9599054392856488]
quadratic analysis is challenging in Multi-Agent Reinforcement Learning (MARL) We propose a MARL algorithm based on the construction of an auxiliary single-agent LQ problem. We show that our algorithm provides a $tildeO(sqrtT)$ regret bound.
arXiv Detail & Related papers (2020-01-27T23:37:41Z)
Algorithms in Multi-Agent Systems: A Holistic Perspective from Reinforcement Learning and Game Theory [2.5147566619221515]
Deep reinforcement learning has achieved outstanding results in recent years. Recent works are exploring learning beyond single-agent scenarios and considering multi-agent scenarios. Traditional game-theoretic algorithms, which, in turn, show bright application promise combined with modern algorithms and boosting computing power.
arXiv Detail & Related papers (2020-01-17T15:08:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.