A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum
Markov Games
- URL: http://arxiv.org/abs/2303.09716v4
- Date: Sat, 28 Oct 2023 20:23:45 GMT
- Title: A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum
Markov Games
- Authors: Anna Winnicki, R. Srikant
- Abstract summary: We show that a simple variant of naive policy iteration for games converges exponentially fast.
We also show that lookahead policies can be implemented efficiently in the function approximation setting of linear Markov games.
- Score: 10.805520579293747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimal policies in standard MDPs can be obtained using either value
iteration or policy iteration. However, in the case of zero-sum Markov games,
there is no efficient policy iteration algorithm; e.g., it has been shown that
one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor,
to implement the only known convergent version of policy iteration. Another
algorithm, called naive policy iteration, is easy to implement but is only
provably convergent under very restrictive assumptions. Prior attempts to fix
naive policy iteration algorithm have several limitations. Here, we show that a
simple variant of naive policy iteration for games converges exponentially
fast. The only addition we propose to naive policy iteration is the use of
lookahead policies, which are anyway used in practical algorithms. We further
show that lookahead can be implemented efficiently in the function
approximation setting of linear Markov games, which are the counterpart of the
much-studied linear MDPs. We illustrate the application of our algorithm by
providing bounds for policy-based RL (reinforcement learning) algorithms. We
extend the results to the function approximation setting.
Related papers
- COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences [31.988100672680154]
We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences.
Our meta-algorithm is simple and can be integrated with many existing methods designed for RLHF and preference optimization.
arXiv Detail & Related papers (2024-10-30T17:13:02Z) - Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization [37.24692425018]
We study online learning in emphconstrained MDPs (CMDPs)
Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial MDPs.
arXiv Detail & Related papers (2024-10-03T07:54:04Z) - Bridging the Gap between Newton-Raphson Method and Regularized Policy
Iteration [13.166738075816493]
We show that regularized policy iteration is strictly equivalent to the standard Newton-Raphson method in the condition of smoothing out Bellman equation with strongly convex functions.
We prove that regularized policy iteration has global linear convergence with the rate being $gamma$ (discount factor)
We also show that a modified version of regularized policy iteration, i.e., with finite-step policy evaluation, is equivalent to inexact Newton method where the Newton iteration formula is solved with truncated iterations.
arXiv Detail & Related papers (2023-10-11T05:55:20Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Provable Benefits of Actor-Critic Methods for Offline Reinforcement
Learning [85.50033812217254]
Actor-critic methods are widely used in offline reinforcement learning practice, but are not so well-understood theoretically.
We propose a new offline actor-critic algorithm that naturally incorporates the pessimism principle.
arXiv Detail & Related papers (2021-08-19T17:27:29Z) - Improved Regret Bound and Experience Replay in Regularized Policy
Iteration [22.621710838468097]
We study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation.
We first show that the regret analysis of the Politex algorithm can be sharpened from $O(T3/4)$ to $O(sqrtT)$ under nearly identical assumptions.
Our result provides the first high-probability $O(sqrtT)$ regret bound for a computationally efficient algorithm in this setting.
arXiv Detail & Related papers (2021-02-25T00:55:07Z) - Optimization Issues in KL-Constrained Approximate Policy Iteration [48.24321346619156]
Many reinforcement learning algorithms can be seen as versions of approximate policy iteration (API)
While standard API often performs poorly, it has been shown that learning can be stabilized by regularizing each policy update by the KL-divergence to the previous policy.
Popular practical algorithms such as TRPO, MPO, and VMPO replace regularization by a constraint on KL-divergence of consecutive policies.
arXiv Detail & Related papers (2021-02-11T19:35:33Z) - Variance Penalized On-Policy and Off-Policy Actor-Critic [60.06593931848165]
We propose on-policy and off-policy actor-critic algorithms that optimize a performance criterion involving both mean and variance in the return.
Our approach not only performs on par with actor-critic and prior variance-penalization baselines in terms of expected return, but also generates trajectories which have lower variance in the return.
arXiv Detail & Related papers (2021-02-03T10:06:16Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.