Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games
- URL: http://arxiv.org/abs/2511.00811v1
- Date: Sun, 02 Nov 2025 05:45:27 GMT
- Title: Equilibrium Policy Generalization: A Reinforcement Learning Framework for Cross-Graph Zero-Shot Generalization in Pursuit-Evasion Games
- Authors: Runyu Lu, Peng Zhang, Ruochuan Shi, Yuanheng Zhu, Dongbin Zhao, Yang Liu, Dong Wang, Cesare Alippi,
- Abstract summary: Pursuit-evasion game (PEG) is an important class of real-world games from the fields of robotics and security.<n>This paper proposes an Equilibrium Policy Generalization framework to learn a generalized policy with robust cross-graph zero-shot performance.<n> Experimental results show that, using equilibrium guidance and a distance feature proposed for cross-graph PEG training, the EPG framework guarantees desirable zero-shot performance.
- Score: 38.70408341845241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Equilibrium learning in adversarial games is an important topic widely examined in the fields of game theory and reinforcement learning (RL). Pursuit-evasion game (PEG), as an important class of real-world games from the fields of robotics and security, requires exponential time to be accurately solved. When the underlying graph structure varies, even the state-of-the-art RL methods require recomputation or at least fine-tuning, which can be time-consuming and impair real-time applicability. This paper proposes an Equilibrium Policy Generalization (EPG) framework to effectively learn a generalized policy with robust cross-graph zero-shot performance. In the context of PEGs, our framework is generally applicable to both pursuer and evader sides in both no-exit and multi-exit scenarios. These two generalizability properties, to our knowledge, are the first to appear in this domain. The core idea of the EPG framework is to train an RL policy across different graph structures against the equilibrium policy for each single graph. To construct an equilibrium oracle for single-graph policies, we present a dynamic programming (DP) algorithm that provably generates pure-strategy Nash equilibrium with near-optimal time complexity. To guarantee scalability with respect to pursuer number, we further extend DP and RL by designing a grouping mechanism and a sequence model for joint policy decomposition, respectively. Experimental results show that, using equilibrium guidance and a distance feature proposed for cross-graph PEG training, the EPG framework guarantees desirable zero-shot performance in various unseen real-world graphs. Besides, when trained under an equilibrium heuristic proposed for the graphs with exits, our generalized pursuer policy can even match the performance of the fine-tuned policies from the state-of-the-art PEG methods.
Related papers
- Decentralized Learning Strategies for Estimation Error Minimization with Graph Neural Networks [86.99017195607077]
We address real-time sampling and estimation of autoregressive Markovian sources in wireless networks.<n>We propose a graphical reinforcement learning framework for policy optimization.<n>Theoretically, our proposed policies are transferable, allowing a policy trained on one graph to be effectively applied to structurally similar graphs.
arXiv Detail & Related papers (2026-01-19T02:18:45Z) - Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning [49.92803982100042]
We propose using the entropy ratio between the current and previous policies as a new global metric.<n>We introduce an textbfEntropy Ratio (ERC) mechanism that imposes bidirectional constraints on the entropy ratio.<n>This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions.
arXiv Detail & Related papers (2025-12-05T10:26:32Z) - R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability [25.176860778665173]
This paper introduces the first approach to worst-case robust real-time pursuit strategies (R2PS) under partial observability.<n>We first prove that a traditional dynamic programming (DP) algorithm for solving Markov PEGs maintains optimality under the asynchronous moves by the evader.<n>We then propose a belief preservation mechanism about the evader's possible positions, extending the DP pursuit strategies to a partially observable setting.
arXiv Detail & Related papers (2025-11-21T16:34:00Z) - A Unified Framework for Zero-Shot Reinforcement Learning [0.2951541543732647]
Zero-shot reinforcement learning (RL) has emerged as a setting for developing general agents in an unsupervised manner.<n>Despite growing interest, the field lacks a common analytical lens.<n>We present the first unified framework for zero-shot RL.
arXiv Detail & Related papers (2025-10-23T13:30:26Z) - Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends [64.71326476563213]
Off-policy reinforcement learning for large language models (LLMs) is attracting growing interest.<n>We present a first-principles derivation for grouprelative REINFORCE without assuming a specific training data distribution.<n>This perspective yields two general principles for adapting REINFORCE to off-policy settings.
arXiv Detail & Related papers (2025-09-29T02:34:54Z) - Policy Optimization for Continuous-time Linear-Quadratic Graphon Mean Field Games [3.1755820123640612]
Graphon mean field games offer a principled framework for approximating such games.<n>We propose and analyze a policy optimization framework for continuous-time, finite-horizon linear-quadratic GMFGs.
arXiv Detail & Related papers (2025-06-06T09:06:06Z) - Reinforcement Learning for Game-Theoretic Resource Allocation on Graphs [9.369330148791201]
Game-theoretic resource allocation on graphs (GRAG) is a problem modeled as a multi-step Colonel Blotto Game (MCBG)<n>We formulate the MCBG as a Markov Decision Process (MDP) and apply Reinforcement Learning (RL) methods, specifically Deep Q-Network (DQN) and Proximal Policy Optimization (PPO)<n>We evaluate RL performance across a variety of graph structures and initial resource distributions, comparing against random, greedy, and learned RL policies.
arXiv Detail & Related papers (2025-05-08T21:12:34Z) - A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance [3.4354636842203026]
We propose PGDA-RL, a primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs)<n>PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem.<n>We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP.
arXiv Detail & Related papers (2025-05-07T15:18:43Z) - Grasper: A Generalist Pursuer for Pursuit-Evasion Problems [36.115954360950134]
Pursuit-evasion games (PEGs) model interactions between a team of pursuers and an evader in graph-based environments.
Recent advancements have demonstrated the effectiveness of the pre-training and fine-tuning paradigm in PSRO.
We introduce Grasper, a GeneRAlist purSuer for Pursuit-Evasion pRoblems, capable of efficiently generating pursuer policies tailored to specific PEGs.
arXiv Detail & Related papers (2024-04-19T04:54:38Z) - Faster Last-iterate Convergence of Policy Optimization in Zero-Sum
Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games.
We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method.
Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.