Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation
- URL: http://arxiv.org/abs/2602.07205v1
- Date: Fri, 06 Feb 2026 21:25:54 GMT
- Title: Online Learning for Uninformed Markov Games: Empirical Nash-Value Regret and Non-Stationarity Adaptation
- Authors: Junyan Liu, Haipeng Luo, Zihan Zhang, Lillian J. Ratliff,
- Abstract summary: We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved.<n>We introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret.<n>We show how to adaptively restart this algorithm with an appropriate $$ in response to the potential non-stationarity of the opponent.
- Score: 54.274028560515454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study online learning in two-player uninformed Markov games, where the opponent's actions and policies are unobserved. In this setting, Tian et al. (2021) show that achieving no-external-regret is impossible without incurring an exponential dependence on the episode length $H$. They then turn to the weaker notion of Nash-value regret and propose a V-learning algorithm with regret $O(K^{2/3})$ after $K$ episodes. However, their algorithm and guarantee do not adapt to the difficulty of the problem: even in the case where the opponent follows a fixed policy and thus $O(\sqrt{K})$ external regret is well-known to be achievable, their result is still the worse rate $O(K^{2/3})$ on a weaker metric. In this work, we fully address both limitations. First, we introduce empirical Nash-value regret, a new regret notion that is strictly stronger than Nash-value regret and naturally reduces to external regret when the opponent follows a fixed policy. Moreover, under this new metric, we propose a parameter-free algorithm that achieves an $O(\min \{\sqrt{K} + (CK)^{1/3},\sqrt{LK}\})$ regret bound, where $C$ quantifies the variance of the opponent's policies and $L$ denotes the number of policy switches (both at most $O(K)$). Therefore, our results not only recover the two extremes -- $O(\sqrt{K})$ external regret when the opponent is fixed and $O(K^{2/3})$ Nash-value regret in the worst case -- but also smoothly interpolate between these extremes by automatically adapting to the opponent's non-stationarity. We achieve so by first providing a new analysis of the epoch-based V-learning algorithm by Mao et al. (2022), establishing an $O(ηC + \sqrt{K/η})$ regret bound, where $η$ is the epoch incremental factor. Next, we show how to adaptively restart this algorithm with an appropriate $η$ in response to the potential non-stationarity of the opponent, eventually achieving our final results.
Related papers
- Convergence of Regret Matching in Potential Games and Constrained Optimization [85.55969013318627]
We show that alternating RM$+$ converges to an $epsilon$-KKT point after $O_epsilon (1/epsilon4)$, establishing for the first time that it is a sound and fast first-order.<n>Our lower bound shows that convergence to coarse correlated equilibria in potential games is exponentially faster than convergence to Nash equilibria.
arXiv Detail & Related papers (2025-10-20T00:45:47Z) - Instance-Dependent Regret Bounds for Learning Two-Player Zero-Sum Games with Bandit Feedback [60.610120215789976]
We show that when a pure strategy Nash equilibrium exists, $c$ becomes zero, leading to an optimal instance-dependent regret bound.<n>Our algorithm also enjoys last-iterate convergence and can identify the pure strategy Nash equilibrium with near-optimal sample.
arXiv Detail & Related papers (2025-02-24T20:20:06Z) - Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)<n>Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.<n>We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z) - p-Mean Regret for Stochastic Bandits [52.828710025519996]
We introduce a simple, unified UCB-based algorithm that achieves novel $p$-mean regret bounds.<n>Our framework encompasses both average cumulative regret and Nash regret as special cases.
arXiv Detail & Related papers (2024-12-14T08:38:26Z) - On the Limitations and Possibilities of Nash Regret Minimization in Zero-Sum Matrix Games under Noisy Feedback [21.332966440675758]
Nash regret is the difference between the player's total reward and the game's Nash equilibrium value scaled by the time horizon $T$.<n>We show that standard algorithms incur Nash regret even when the row player receives noisy feedback on the entire matrix $A$.<n>We present the first algorithm for general $n times m$ matrix games that achieves instance-dependent $textpolylog(T)$ Nash regret.
arXiv Detail & Related papers (2023-06-22T22:45:48Z) - Achieving Better Regret against Strategic Adversaries [15.51709428653595]
We study online learning problems in which the learner has extra knowledge about the adversary's behaviour.
We propose two new online learning algorithms, Accurate Follow the Regularized Leader (AFTRL) and Prod-Best Response (Prod-BR)
AFTRL achieves $O(1)$ external regret or $O(1)$ emphforward regret against no-external regret adversary.
arXiv Detail & Related papers (2023-02-13T19:34:36Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.