Competitive Policy Optimization
- URL: http://arxiv.org/abs/2006.10611v1
- Date: Thu, 18 Jun 2020 15:31:09 GMT
- Title: Competitive Policy Optimization
- Authors: Manish Prajapat, Kamyar Azizzadenesheli, Alexander Liniger, Yisong
Yue, Anima Anandkumar
- Abstract summary: We propose a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates.
Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective.
We empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games.
- Score: 137.17299766844596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A core challenge in policy optimization in competitive Markov decision
processes is the design of efficient optimization methods with desirable
convergence and stability properties. To tackle this, we propose competitive
policy optimization (CoPO), a novel policy gradient approach that exploits the
game-theoretic nature of competitive games to derive policy updates. Motivated
by the competitive gradient optimization method, we derive a bilinear
approximation of the game objective. In contrast, off-the-shelf policy gradient
methods utilize only linear approximations, and hence do not capture
interactions among the players. We instantiate CoPO in two ways:(i) competitive
policy gradient, and (ii) trust-region competitive policy optimization. We
theoretically study these methods, and empirically investigate their behavior
on a set of comprehensive, yet challenging, competitive games. We observe that
they provide stable optimization, convergence to sophisticated strategies, and
higher scores when played against baseline policy gradient methods.
Related papers
- Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs.
In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective.
We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z) - Best of Both Worlds in Online Control: Competitive Ratio and Policy
Regret [61.59646565655169]
We show that several recently proposed online control algorithms achieve the best of both worlds: sublinear regret vs. the best DAC policy selected in hindsight.
We conclude that sublinear regret vs. the optimal competitive policy is attainable when the linear dynamical system is unknown.
arXiv Detail & Related papers (2022-11-21T07:29:08Z) - On the convergence of policy gradient methods to Nash equilibria in
general stochastic games [33.786186304912]
We study the long-run behavior of policy gradient methods with respect to Nash equilibrium policies.
We show that policy gradient trajectories with gradient estimates provided by the REINFORCE algorithm achieve an $mathcalO (1/sqrtn)$ distance-squared convergence rate.
arXiv Detail & Related papers (2022-10-17T08:51:59Z) - Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning.
We form a second trust region through the construction of another virtual policy that represents a wide range of past policies.
We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z) - Independent Policy Gradient Methods for Competitive Reinforcement
Learning [62.91197073795261]
We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents.
We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule.
arXiv Detail & Related papers (2021-01-11T23:20:42Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.