Generalized Bandit Regret Minimizer Framework in Imperfect Information
Extensive-Form Game
- URL: http://arxiv.org/abs/2203.05920v4
- Date: Fri, 18 Aug 2023 14:16:12 GMT
- Title: Generalized Bandit Regret Minimizer Framework in Imperfect Information
Extensive-Form Game
- Authors: Linjian Meng, Yang Gao
- Abstract summary: We consider the problem in the interactive bandit-feedback setting where we don't know the dynamics of the IIEG.
To learn NE, the regret minimizer is required to estimate the full-feedback loss gradient $ellt$ by $v(zt)$ and minimize the regret.
We present a novel method SIX-OMD to learn approximate NE. It is model-free and extremely improves the best existing convergence rate from the order of $O(sqrtX B/T+sqrtY C/T)$ to $O(
- Score: 9.933208900617174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Regret minimization methods are a powerful tool for learning approximate Nash
equilibrium (NE) in two-player zero-sum imperfect information extensive-form
games (IIEGs). We consider the problem in the interactive bandit-feedback
setting where we don't know the dynamics of the IIEG. In general, only the
interactive trajectory and the reached terminal node value $v(z^t)$ are
revealed. To learn NE, the regret minimizer is required to estimate the
full-feedback loss gradient $\ell^t$ by $v(z^t)$ and minimize the regret. In
this paper, we propose a generalized framework for this learning setting. It
presents a theoretical framework for the design and the modular analysis of the
bandit regret minimization methods. We demonstrate that the most recent bandit
regret minimization methods can be analyzed as a particular case of our
framework. Following this framework, we describe a novel method SIX-OMD to
learn approximate NE. It is model-free and extremely improves the best existing
convergence rate from the order of $O(\sqrt{X B/T}+\sqrt{Y C/T})$ to $O(\sqrt{
M_{\mathcal{X}}/T} +\sqrt{ M_{\mathcal{Y}}/T})$. Moreover, SIX-OMD is
computationally efficient as it needs to perform the current strategy and
average strategy updates only along the sampled trajectory.
Related papers
- Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Borda Regret Minimization for Generalized Linear Dueling Bandits [65.09919504862496]
We study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score.
We propose a rich class of generalized linear dueling bandit models, which cover many existing models.
Our algorithm achieves an $tildeO(d2/3 T2/3)$ regret, which is also optimal.
arXiv Detail & Related papers (2023-03-15T17:59:27Z) - Revisiting Weighted Strategy for Non-stationary Parametric Bandits [82.1942459195896]
This paper revisits the weighted strategy for non-stationary parametric bandits.
We propose a refined analysis framework, which produces a simpler weight-based algorithm.
Our new framework can be used to improve regret bounds of other parametric bandits.
arXiv Detail & Related papers (2023-03-05T15:11:14Z) - Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Expected Worst Case Regret via Stochastic Sequential Covering [14.834625066344582]
We introduce a notion of expected worst case minimax regret that generalizes and encompasses prior known minimax regrets.
For such minimax regrets we establish tight bounds via a novel concept of upper global sequential covering.
We demonstrate the effectiveness of our approach by establishing tight bounds on the expected worst case minimax regrets for logarithmic loss and general mixable losses.
arXiv Detail & Related papers (2022-09-09T17:31:46Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.