Sequential Principal-Agent Problems with Communication: Efficient
Computation and Learning
- URL: http://arxiv.org/abs/2306.03832v2
- Date: Sun, 17 Dec 2023 13:34:46 GMT
- Title: Sequential Principal-Agent Problems with Communication: Efficient
Computation and Learning
- Authors: Jiarui Gan, Rupak Majumdar, Debmalya Mandal, Goran Radanovic
- Abstract summary: We study a sequential decision making problem between a principal and an agent with incomplete information on both sides.
In this model, the principal and the agent interact in a environment, and each is privy to observations about the state not available to the other.
We present an algorithmic-time algorithm to compute the principal's optimal policy up to an additive approximation.
- Score: 27.50523143132825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study a sequential decision making problem between a principal and an
agent with incomplete information on both sides. In this model, the principal
and the agent interact in a stochastic environment, and each is privy to
observations about the state not available to the other. The principal has the
power of commitment, both to elicit information from the agent and to provide
signals about her own information. The principal and the agent communicate
their signals to each other, and select their actions independently based on
this communication. Each player receives a payoff based on the state and their
joint actions, and the environment moves to a new state. The interaction
continues over a finite time horizon, and both players act to optimize their
own total payoffs over the horizon. Our model encompasses as special cases
stochastic games of incomplete information and POMDPs, as well as sequential
Bayesian persuasion and mechanism design problems. We study both computation of
optimal policies and learning in our setting. While the general problems are
computationally intractable, we study algorithmic solutions under a conditional
independence assumption on the underlying state-observation distributions. We
present a polynomial-time algorithm to compute the principal's optimal policy
up to an additive approximation. Additionally, we show an efficient learning
algorithm in the case where the transition probabilities are not known
beforehand. The algorithm guarantees sublinear regret for both players.
Related papers
- Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees [91.88803125231189]
Multi-step Preference Optimization (MPO) is built upon the natural actor-critic frameworkciteprakhlin2013online,joulani17a.
We show that OMPO requires $mathcalO(epsilon-1)$ policy updates to converge to an $epsilon$-approximate Nash equilibrium.
We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.
arXiv Detail & Related papers (2025-02-18T09:33:48Z) - Refined Sample Complexity for Markov Games with Independent Linear Function Approximation [49.5660193419984]
Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL)
This paper first refines the AVLPR framework by Wang et al. (2023), with an insight of designing pessimistic estimation of the sub-optimality gap.
We give the first algorithm that tackles the curse of multi-agents, attains the optimal $O(T-1/2) convergence rate, and avoids $textpoly(A_max)$ dependency simultaneously.
arXiv Detail & Related papers (2024-02-11T01:51:15Z) - Decentralized Multi-Task Online Convex Optimization Under Random Link
Failures [5.513958040574729]
We develop a robust decentralized saddle-point algorithm against random link failures with heterogeneous probabilities.
We extend our algorithm and analysis to the two-point bandit feedback scenario.
arXiv Detail & Related papers (2024-01-04T00:57:33Z) - Task-Guided IRL in POMDPs that Scales [22.594913269327353]
In inverse linear reinforcement learning (IRL), a learning agent infers a reward function encoding the underlying task using demonstrations from experts.
Most IRL techniques require the computationally forward problem -- computing an optimal policy given a reward function -- in POMDPs.
We develop an algorithm that reduces the information while increasing the data efficiency.
arXiv Detail & Related papers (2022-12-30T21:08:57Z) - Independent Policy Gradient for Large-Scale Markov Potential Games:
Sharper Rates, Function Approximation, and Game-Agnostic Convergence [30.084357461497042]
We learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large.
We propose new independent policy gradient algorithms that are run by all players in tandem.
We identify a class of independent policy gradient algorithms that enjoys convergence for both zero-sum Markov games and Markov cooperative games with the players that are oblivious to the types of games being played.
arXiv Detail & Related papers (2022-02-08T20:09:47Z) - Cooperative Online Learning in Stochastic and Adversarial MDPs [50.62439652257712]
We study cooperative online learning in and adversarial Markov decision process (MDP)
In each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret.
We are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.
arXiv Detail & Related papers (2022-01-31T12:32:11Z) - Provably Efficient Reinforcement Learning in Decentralized General-Sum
Markov Games [5.205867750232226]
This paper addresses the problem of learning an equilibrium efficiently in general-sum Markov games.
We propose an algorithm in which each agent independently runs optimistic V-learning to efficiently explore the unknown environment.
We show that the agents can find an $epsilon$-approximate CCE in at most $widetildeO( H6S A /epsilon2)$ episodes.
arXiv Detail & Related papers (2021-10-12T02:01:22Z) - Reinforcement Learning in Reward-Mixing MDPs [74.41782017817808]
episodic reinforcement learning in a reward-mixing Markov decision process (MDP)
cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
epsilon$-optimal policy after exploring $tildeO(poly(H,epsilon-1) cdot S2 A2)$ episodes, where $H$ is time-horizon and $S, A$ are the number of states and actions respectively.
arXiv Detail & Related papers (2021-10-07T18:55:49Z) - Towards General Function Approximation in Zero-Sum Markov Games [126.58493169301012]
This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves.
Provably efficient algorithms for both decoupled and coordinated settings are developed.
arXiv Detail & Related papers (2021-07-30T15:25:13Z) - Bandit Linear Optimization for Sequential Decision Making and
Extensive-Form Games [102.23975166536326]
Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment.
It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on observed history.
In this paper, we give the first algorithm for the bandit linear optimization problem for dilatedDM that offers both (i) linear-time losses and (ii) $O(sqrtT)$ cumulative regret in
arXiv Detail & Related papers (2021-03-08T05:00:13Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.