Provably Efficient Model-free RL in Leader-Follower MDP with Linear
Function Approximation
- URL: http://arxiv.org/abs/2211.15792v1
- Date: Mon, 28 Nov 2022 21:59:58 GMT
- Title: Provably Efficient Model-free RL in Leader-Follower MDP with Linear
Function Approximation
- Authors: Arnob Ghosh
- Abstract summary: We consider a multi-agent episodic MDP setup where an agent (leader) takes action at each step of the episode followed by another agent (follower)
We propose a em model-free RL algorithm and show that $tildemathcalO(sqrtd3H3T)$ regret bounds can be achieved for both the leader and the follower.
This is the first sub-linear regret bound guarantee for the Markov games with non-myopic followers with function approximation.
- Score: 1.370633147306388
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We consider a multi-agent episodic MDP setup where an agent (leader) takes
action at each step of the episode followed by another agent (follower). The
state evolution and rewards depend on the joint action pair of the leader and
the follower. Such type of interactions can find applications in many domains
such as smart grids, mechanism design, security, and policymaking. We are
interested in how to learn policies for both the players with provable
performance guarantee under a bandit feedback setting. We focus on a setup
where both the leader and followers are {\em non-myopic}, i.e., they both seek
to maximize their rewards over the entire episode and consider a linear MDP
which can model continuous state-space which is very common in many RL
applications. We propose a {\em model-free} RL algorithm and show that
$\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret bounds can be achieved for both
the leader and the follower, where $d$ is the dimension of the feature mapping,
$H$ is the length of the episode, and $T$ is the total number of steps under
the bandit feedback information setup. Thus, our result holds even when the
number of states becomes infinite. The algorithm relies on {\em novel}
adaptation of the LSVI-UCB algorithm. Specifically, we replace the standard
greedy policy (as the best response) with the soft-max policy for both the
leader and the follower. This turns out to be key in establishing uniform
concentration bound for the value functions. To the best of our knowledge, this
is the first sub-linear regret bound guarantee for the Markov games with
non-myopic followers with function approximation.
Related papers
- Provably Efficient Generalized Lagrangian Policy Optimization for Safe
Multi-Agent Reinforcement Learning [105.7510838453122]
We examine online safe multi-agent reinforcement learning using constrained Markov games.
We develop an upper confidence reinforcement learning algorithm to solve this Lagrangian problem.
Our algorithm updates the minimax decision primal variables via online mirror descent and the dual variable via projected gradient step.
arXiv Detail & Related papers (2023-05-31T22:09:24Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Collaborative Multi-agent Stochastic Linear Bandits [28.268809091816287]
We study a collaborative multi-agent linear bandit setting, where $N$ agents that form a network communicate locally to minimize their overall regret.
All the agents observe the corresponding rewards of the played actions and use an accelerated consensus procedure to compute an estimate of the average of the rewards obtained by all the agents.
arXiv Detail & Related papers (2022-05-12T19:46:35Z) - Nearly Minimax Algorithms for Linear Bandits with Shared Representation [86.79657561369397]
We consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(ll d)$ dimensional linear representation.
We come up with novel algorithms that achieve $widetildeOleft(dsqrtkMT + kMsqrtTright)$ regret bounds, which matches the known minimax regret lower bound up to logarithmic factors.
arXiv Detail & Related papers (2022-03-29T15:27:13Z) - Decentralized Cooperative Reinforcement Learning with Hierarchical
Information Structure [14.919120396838208]
We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications.
In each step the leader" chooses her action first, and then the follower" decides his action after observing the leader's action.
For the MDP setting, we obtain $widetildemathcalO(sqrtH7S2ABT)$ regret, where $H$ is the number of steps per episode, $S$ is the number of states
arXiv Detail & Related papers (2021-11-01T09:18:07Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Near-optimal Representation Learning for Linear Bandits and Linear RL [41.33483293243257]
We first consider the setting where we play $M$ linear bandits with dimension $d$ concurrently.
These bandits share a common $k$-dimensional linear representation so that $kll d$ and $k ll M$.
We propose a sample-efficient algorithm, MTLR-OFUL, which leverages the shared representation to achieve $tildeO(MsqrtdkT + dsqrtkMT )$ regret.
arXiv Detail & Related papers (2021-02-08T11:11:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.