A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with
an Arbitrary Opponent
- URL: http://arxiv.org/abs/2109.03396v3
- Date: Mon, 11 Mar 2024 10:06:17 GMT
- Title: A Bayesian Learning Algorithm for Unknown Zero-sum Stochastic Games with
an Arbitrary Opponent
- Authors: Mehdi Jafarnia-Jahromi, Rahul Jain, Ashutosh Nayyar
- Abstract summary: Posterior Sampling Reinforcement Learning for Zero-sum Games (PSRL-ZSG)
We propose Posterior Sampling Reinforcement Learning for Zero-sum Games (PSRL-ZSG)
- Score: 9.094186120476174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose Posterior Sampling Reinforcement Learning for
Zero-sum Stochastic Games (PSRL-ZSG), the first online learning algorithm that
achieves Bayesian regret bound of $O(HS\sqrt{AT})$ in the infinite-horizon
zero-sum stochastic games with average-reward criterion. Here $H$ is an upper
bound on the span of the bias function, $S$ is the number of states, $A$ is the
number of joint actions and $T$ is the horizon. We consider the online setting
where the opponent can not be controlled and can take any arbitrary
time-adaptive history-dependent strategy. Our regret bound improves on the best
existing regret bound of $O(\sqrt[3]{DS^2AT^2})$ by Wei et al. (2017) under the
same assumption and matches the theoretical lower bound in $T$.
Related papers
- Best of Both Worlds: Regret Minimization versus Minimax Play [57.68976579579758]
We show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $Omega(T)$ from exploitable opponents.
arXiv Detail & Related papers (2025-02-17T11:04:01Z) - Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback [49.84060509296641]
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit)
Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory.
We introduce the first Policy Optimization algorithms for this setting.
arXiv Detail & Related papers (2025-02-06T12:03:24Z) - Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games [1.9161424760743742]
We propose an online model selection algorithm for the average reward RL setting.
We show that the additional cost of model selection scales only linearly in $M$, the number of model classes.
We also show that the exponential dependency on $m*$ is inevitable by proving a lower bound on the learner's regret.
arXiv Detail & Related papers (2024-11-09T05:03:10Z) - Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs [72.40181882916089]
We show that our algorithm achieves an $tildeObig((d+log (|mathcalS|2 |mathcalA|))sqrtKbig)$ regret with full-information feedback, where $d$ is the dimension of a known feature mapping linearly parametrizing the unknown transition kernel of the MDP, $K$ is the number of episodes, $|mathcalS|$ and $|mathcalA|$ are the cardinalities of the state and action spaces
arXiv Detail & Related papers (2023-05-15T05:37:32Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Minimax-Optimal Multi-Agent RL in Zero-Sum Markov Games With a
Generative Model [50.38446482252857]
Two-player zero-sum Markov games are arguably the most basic setting in multi-agent reinforcement learning.
We develop a learning algorithm that learns an $varepsilon$-approximate Markov NE policy using $$ widetildeObigg.
We derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities.
arXiv Detail & Related papers (2022-08-22T17:24:55Z) - Fast Rates for Nonparametric Online Learning: From Realizability to
Learning in Games [36.969021834291745]
We propose a proper learning algorithm which gets a near-optimal mistake bound in terms of the sequential fat-shattering dimension of the hypothesis class.
This result answers a question as to whether proper learners could achieve near-optimal mistake bounds.
For the real-valued (regression) setting, the optimal mistake bound was not even known for improper learners.
arXiv Detail & Related papers (2021-11-17T05:24:21Z) - Near-Optimal Reinforcement Learning with Self-Play [50.29853537456737]
We focus on self-play algorithms which learn the optimal policy by playing against itself without any direct supervision.
We propose an optimistic variant of the emphNash Q-learning algorithm with sample complexity $tildemathcalO(SAB)$, and a new emphNash V-learning algorithm with sample complexity $tildemathcalO(S(A+B))$.
arXiv Detail & Related papers (2020-06-22T05:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.