On Dynamic Programming Theory for Leader-Follower Stochastic Games
- URL: http://arxiv.org/abs/2512.05667v1
- Date: Fri, 05 Dec 2025 12:23:56 GMT
- Title: On Dynamic Programming Theory for Leader-Follower Stochastic Games
- Authors: Jilles Steeve Dibangoye, Thibaut Le Marre, Ocan Sankur, François Schwarzentruber,
- Abstract summary: Leader-follower general-sum games (LF-GSSGs) model sequential decision-making under asymmetric commitment.<n>This paper introduces a dynamic programming framework that applies Bellman over credible sets-state abstractions.
- Score: 10.079626733116612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leader-follower general-sum stochastic games (LF-GSSGs) model sequential decision-making under asymmetric commitment, where a leader commits to a policy and a follower best responds, yielding a strong Stackelberg equilibrium (SSE) with leader-favourable tie-breaking. This paper introduces a dynamic programming (DP) framework that applies Bellman recursion over credible sets-state abstractions formally representing all rational follower best responses under partial leader commitments-to compute SSEs. We first prove that any LF-GSSG admits a lossless reduction to a Markov decision process (MDP) over credible sets. We further establish that synthesising an optimal memoryless deterministic leader policy is NP-hard, motivating the development of ε-optimal DP algorithms with provable guarantees on leader exploitability. Experiments on standard mixed-motive benchmarks-including security games, resource allocation, and adversarial planning-demonstrate empirical gains in leader value and runtime scalability over state-of-the-art methods.
Related papers
- R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging [69.96389360650072]
We show that reasoning fidelity is highly predictive of downstream RLHF outcomes, beyond standard label accuracy.<n>We propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment.
arXiv Detail & Related papers (2026-02-06T15:17:11Z) - Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective [85.06838178922791]
Reinforcement Learning (RL) has proven highly effective for autoregressive language models.<n>But adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges.<n>We propose a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy.
arXiv Detail & Related papers (2025-12-03T13:05:32Z) - Latent Chain-of-Thought for Visual Reasoning [53.541579327424046]
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs)<n>We reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference.<n>We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks.
arXiv Detail & Related papers (2025-10-27T23:10:06Z) - Generating Fair Consensus Statements with Social Choice on Token-Level MDPs [7.5036512760759715]
We model the task as a multi-objective, token-level Markov Decision Process (MDP)<n> Token-level rewards for each agent are derived from their policy (e.g., a personalized language model)<n>This approach utilizes the finding that such policies implicitly define optimal Q-functions, providing a principled way to quantify rewards at each generation step without a value function.
arXiv Detail & Related papers (2025-10-15T21:23:18Z) - Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis [22.360309142419208]
We study policy optimization in Stackelberg mean field games (MFGs)<n>We propose AC-SMFG, a single-loop actor-critic algorithm that operates on continuously generated Markovian samples.<n>We establish the finite-time and finite-sample convergence of the algorithm to a stationary point of the Stackelberg objective.
arXiv Detail & Related papers (2025-09-18T19:58:31Z) - Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning [55.33984461046492]
Policy-based methods currently dominate reinforcement learning pipelines for large language model (LLM) reasoning.<n>We introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs.<n>We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy via an improved change-of-trajectory-measure analysis.
arXiv Detail & Related papers (2025-05-21T09:41:53Z) - Contextual Bilevel Reinforcement Learning for Incentive Alignment [42.22085862132403]
We introduce Contextual Bilevel Reinforcement Learning (CB-RL), a bilevel decision-making model.<n>CB-RL can be viewed as a Stackelberg Game where the leader and a random context beyond the leader's control together decide the setup of many MDPs.<n>This framework extends beyond traditional bilevel optimization and finds relevance in diverse fields such as reward shaping, contract theory and mechanism design.
arXiv Detail & Related papers (2024-06-03T17:54:39Z) - Policy Iteration for Pareto-Optimal Policies in Stochastic Stackelberg Games [0.0]
In general-sum games, a stationary Stackelberg equilibrium (SSE) does not always exist.
Existing methods of determining the SSEs require strong assumptions to guarantee the convergence and the coincidence of the limit with the SSE.
arXiv Detail & Related papers (2024-05-07T07:40:42Z) - Actions Speak What You Want: Provably Sample-Efficient Reinforcement
Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks [94.07688076435818]
We study reinforcement learning for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure.
Our algorithms are based on (i) learning the quantal response model via maximum likelihood estimation and (ii) model-free or model-based RL for solving the leader's decision making problem.
arXiv Detail & Related papers (2023-07-26T10:24:17Z) - Group Distributionally Robust Reinforcement Learning with Hierarchical
Latent Variables [20.078557260741988]
Group Distributionally Robust Markov Decision Process (GDR-MDP) is a flexible hierarchical MDP formulation that encodes task groups via a latent mixture model.
GDR-MDP identifies the optimal policy that maximizes the expected return under the worst-possible qualified belief over task groups.
We then develop deep RL algorithms for GDR-MDP for both value-based and policy-based RL methods.
arXiv Detail & Related papers (2022-10-21T21:34:59Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.