Related papers: Competitive Policy Optimization

Competitive Policy Optimization

URL: http://arxiv.org/abs/2006.10611v1
Date: Thu, 18 Jun 2020 15:31:09 GMT
Title: Competitive Policy Optimization
Authors: Manish Prajapat, Kamyar Azizzadenesheli, Alexander Liniger, Yisong Yue, Anima Anandkumar
Abstract summary: We propose a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. We empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games.
Score: 137.17299766844596
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A core challenge in policy optimization in competitive Markov decision processes is the design of efficient optimization methods with desirable convergence and stability properties. To tackle this, we propose competitive policy optimization (CoPO), a novel policy gradient approach that exploits the game-theoretic nature of competitive games to derive policy updates. Motivated by the competitive gradient optimization method, we derive a bilinear approximation of the game objective. In contrast, off-the-shelf policy gradient methods utilize only linear approximations, and hence do not capture interactions among the players. We instantiate CoPO in two ways:(i) competitive policy gradient, and (ii) trust-region competitive policy optimization. We theoretically study these methods, and empirically investigate their behavior on a set of comprehensive, yet challenging, competitive games. We observe that they provide stable optimization, convergence to sophisticated strategies, and higher scores when played against baseline policy gradient methods.

Related papers

Proactive Constrained Policy Optimization with Preemptive Penalty [11.93135424276656]
We propose a novel preemptive penalty mechanism for constrained policy optimization.<n>This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost.<n>We also introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary.
arXiv Detail & Related papers (2025-08-03T18:35:55Z)
Improving DAPO from a Mixed-Policy Perspective [0.0]
This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm.<n>We first propose a method that incorporates a pre-trained, stable guiding policy to provide off-policy experience.<n>We then extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies.
arXiv Detail & Related papers (2025-07-17T09:12:09Z)
Clipped-Objective Policy Gradients for Pessimistic Policy Optimization [3.2996723916635275]
Policy gradient methods seek to produce monotonic improvement through bounded changes in policy outputs. In this work, we find that the performance of PPO, when applied to continuous action spaces, may be consistently improved through a simple change in objective. We show that the clipped-objective policy gradient (COPG) objective is on average "pessimistic" compared to both the PPO objective and (2) this pessimism promotes enhanced exploration.
arXiv Detail & Related papers (2023-11-10T03:02:49Z)
Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates. We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change. We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z)
Policy Gradient Algorithms Implicitly Optimize by Continuation [7.351769270728942]
We argue that exploration in policy-gradient algorithms consists in a continuation of the return of the policy at hand, and that policies should be history-dependent rather than to maximize the return.
arXiv Detail & Related papers (2023-05-11T14:50:20Z)
Best of Both Worlds in Online Control: Competitive Ratio and Policy Regret [61.59646565655169]
We show that several recently proposed online control algorithms achieve the best of both worlds: sublinear regret vs. the best DAC policy selected in hindsight. We conclude that sublinear regret vs. the optimal competitive policy is attainable when the linear dynamical system is unknown.
arXiv Detail & Related papers (2022-11-21T07:29:08Z)
On the convergence of policy gradient methods to Nash equilibria in general stochastic games [33.786186304912]
We study the long-run behavior of policy gradient methods with respect to Nash equilibrium policies. We show that policy gradient trajectories with gradient estimates provided by the REINFORCE algorithm achieve an $mathcalO (1/sqrtn)$ distance-squared convergence rate.
arXiv Detail & Related papers (2022-10-17T08:51:59Z)
Memory-Constrained Policy Optimization [59.63021433336966]
We introduce a new constrained optimization method for policy gradient reinforcement learning. We form a second trust region through the construction of another virtual policy that represents a wide range of past policies. We then enforce the new policy to stay closer to the virtual policy, which is beneficial in case the old policy performs badly.
arXiv Detail & Related papers (2022-04-20T08:50:23Z)
Independent Policy Gradient Methods for Competitive Reinforcement Learning [62.91197073795261]
We obtain global, non-asymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents. We show that if both players run policy gradient methods in tandem, their policies will converge to a min-max equilibrium of the game, as long as their learning rates follow a two-timescale rule.
arXiv Detail & Related papers (2021-01-11T23:20:42Z)
Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL) We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.