Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis
- URL: http://arxiv.org/abs/2509.15392v1
- Date: Thu, 18 Sep 2025 19:58:31 GMT
- Title: Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis
- Authors: Sihan Zeng, Benjamin Patrick Evans, Sujay Bhatt, Leo Ardon, Sumitra Ganesh, Alec Koppel,
- Abstract summary: We study policy optimization in Stackelberg mean field games (MFGs)<n>We propose AC-SMFG, a single-loop actor-critic algorithm that operates on continuously generated Markovian samples.<n>We establish the finite-time and finite-sample convergence of the algorithm to a stationary point of the Stackelberg objective.
- Score: 22.360309142419208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study policy optimization in Stackelberg mean field games (MFGs), a hierarchical framework for modeling the strategic interaction between a single leader and an infinitely large population of homogeneous followers. The objective can be formulated as a structured bi-level optimization problem, in which the leader needs to learn a policy maximizing its reward, anticipating the response of the followers. Existing methods for solving these (and related) problems often rely on restrictive independence assumptions between the leader's and followers' objectives, use samples inefficiently due to nested-loop algorithm structure, and lack finite-time convergence guarantees. To address these limitations, we propose AC-SMFG, a single-loop actor-critic algorithm that operates on continuously generated Markovian samples. The algorithm alternates between (semi-)gradient updates for the leader, a representative follower, and the mean field, and is simple to implement in practice. We establish the finite-time and finite-sample convergence of the algorithm to a stationary point of the Stackelberg objective. To our knowledge, this is the first Stackelberg MFG algorithm with non-asymptotic convergence guarantees. Our key assumption is a "gradient alignment" condition, which requires that the full policy gradient of the leader can be approximated by a partial component of it, relaxing the existing leader-follower independence assumption. Simulation results in a range of well-established economics environments demonstrate that AC-SMFG outperforms existing multi-agent and MFG learning baselines in policy quality and convergence speed.
Related papers
- Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization [7.961090665261694]
We propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization.<n>By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment.
arXiv Detail & Related papers (2026-03-03T07:45:40Z) - Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation [73.40696661117408]
We propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task.<n>We name our approach MP2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet.<n> Experiments showcase that MP2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches.
arXiv Detail & Related papers (2025-07-31T09:42:42Z) - Improving monotonic optimization in heterogeneous multi-agent reinforcement learning with optimal marginal deterministic policy gradient [18.64288030584699]
heterogeneous multi-agent reinforcement learning (MARL)<n>Objectively replace the sequentially computed $Q_psi*(s,a_1:i)$ with the Optimal Marginal Q function $phi_psi*(s,a_1:i)$ derived from Q-functions.<n>Generalized Q Critic (GQC) as the critic function, employing pessimistic uncertainty-constrained loss to optimize different Q-value estimations.
arXiv Detail & Related papers (2025-07-14T07:16:01Z) - Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z) - Bandits with Preference Feedback: A Stackelberg Game Perspective [41.928798759636216]
Bandits with preference feedback present a powerful tool for optimizing unknown target functions.
We propose MAXMINLCB, which emulates a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards.
arXiv Detail & Related papers (2024-06-24T15:53:11Z) - Stackelberg Batch Policy Learning [3.5426153040167754]
Batch reinforcement learning (RL) defines the task of learning from a fixed batch of data lacking exhaustive exploration.
Worst-case optimality algorithms, which calibrate a value-function model class from logged experience, have emerged as a promising paradigm for batch RL.
We propose a novel gradient-based learning algorithm: StackelbergLearner, in which the leader player updates according to the total derivative of its objective instead of the usual individual gradient.
arXiv Detail & Related papers (2023-09-28T06:18:34Z) - Actions Speak What You Want: Provably Sample-Efficient Reinforcement
Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks [94.07688076435818]
We study reinforcement learning for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure.
Our algorithms are based on (i) learning the quantal response model via maximum likelihood estimation and (ii) model-free or model-based RL for solving the leader's decision making problem.
arXiv Detail & Related papers (2023-07-26T10:24:17Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - A Communication-efficient Algorithm with Linear Convergence for
Federated Minimax Learning [1.713291434132985]
We study a large-scale multi-agent minimax optimization problem, which models Geneimation Adversarial Networks (GANs)
The overall objective is a sum of agents' private local objective functions.
We show that FedGDA-GT converges linearly with a constant stepsize to global $epsilon GDA solution.
arXiv Detail & Related papers (2022-06-02T16:31:16Z) - Joint Optimization of Multi-Objective Reinforcement Learning with Policy Gradient Based Algorithm [50.50545326342971]
We formulate the problem of maximizing a non-linear concave function of multiple long-term objectives.<n>A policy-gradient based model-free algorithm is proposed for the problem.<n>The proposed algorithm is shown to achieve convergence to within an $epsilon$ of the global optima.
arXiv Detail & Related papers (2021-05-28T22:20:54Z) - Global Convergence of Policy Gradient for Linear-Quadratic Mean-Field
Control/Game in Continuous Time [109.06623773924737]
We study the policy gradient method for the linear-quadratic mean-field control and game.
We show that it converges to the optimal solution at a linear rate, which is verified by a synthetic simulation.
arXiv Detail & Related papers (2020-08-16T06:34:11Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.