Model-Free Online Learning in Unknown Sequential Decision Making
Problems and Games
- URL: http://arxiv.org/abs/2103.04539v1
- Date: Mon, 8 Mar 2021 04:03:24 GMT
- Title: Model-Free Online Learning in Unknown Sequential Decision Making
Problems and Games
- Authors: Gabriele Farina, Tuomas Sandholm
- Abstract summary: In large two-player zero-sum imperfect-information games, modern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium.
We formalize an online learning setting in which the strategy space is not known to the agent.
We give an efficient algorithm that achieves $O(T3/4)$ regret with high probability for that setting, even when the agent faces an adversarial environment.
- Score: 114.90723492840499
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Regret minimization has proved to be a versatile tool for tree-form
sequential decision making and extensive-form games. In large two-player
zero-sum imperfect-information games, modern extensions of counterfactual
regret minimization (CFR) are currently the practical state of the art for
computing a Nash equilibrium. Most regret-minimization algorithms for tree-form
sequential decision making, including CFR, require (i) an exact model of the
player's decision nodes, observation nodes, and how they are linked, and (ii)
full knowledge, at all times t, about the payoffs -- even in parts of the
decision space that are not encountered at time t. Recently, there has been
growing interest towards relaxing some of those restrictions and making regret
minimization applicable to settings for which reinforcement learning methods
have traditionally been used -- for example, those in which only black-box
access to the environment is available. We give the first, to our knowledge,
regret-minimization algorithm that guarantees sublinear regret with high
probability even when requirement (i) -- and thus also (ii) -- is dropped. We
formalize an online learning setting in which the strategy space is not known
to the agent and gets revealed incrementally whenever the agent encounters new
decision points. We give an efficient algorithm that achieves $O(T^{3/4})$
regret with high probability for that setting, even when the agent faces an
adversarial environment. Our experiments show it significantly outperforms the
prior algorithms for the problem, which do not have such guarantees. It can be
used in any application for which regret minimization is useful: approximating
Nash equilibrium or quantal response equilibrium, approximating coarse
correlated equilibrium in multi-player games, learning a best response,
learning safe opponent exploitation, and online play against an unknown
opponent/environment.
Related papers
- Barriers to Welfare Maximization with No-Regret Learning [68.66209476382213]
We prove lower bounds for computing a near-optimal $T$-sparse CCE.
In particular, we show that the inapproximability of maximum clique precludes attaining any non-trivial sparsity in time.
arXiv Detail & Related papers (2024-11-04T00:34:56Z) - Hardness of Independent Learning and Sparse Equilibrium Computation in
Markov Games [70.19141208203227]
We consider the problem of decentralized multi-agent reinforcement learning in Markov games.
We show that no algorithm attains no-regret in general-sum games when executed independently by all players.
We show that our lower bounds hold even for seemingly easier setting in which all agents are controlled by a centralized algorithm.
arXiv Detail & Related papers (2023-03-22T03:28:12Z) - Oracle-Efficient Smoothed Online Learning for Piecewise Continuous Decision Making [73.48977854003697]
This work introduces a new notion of complexity, the generalized bracketing numbers, which marries constraints on the adversary to the size of the space.
We then instantiate our bounds in several problems of interest, including online prediction and planning of piecewise continuous functions.
arXiv Detail & Related papers (2023-02-10T18:45:52Z) - Decentralized model-free reinforcement learning in stochastic games with
average-reward objective [1.9852463786440127]
We show that our algorithm achieves both sublinear high probability regret of order $T3/4$ and sublinear expected regret of order $T2/3$.
Our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works.
arXiv Detail & Related papers (2023-01-13T15:59:53Z) - Regret Minimization and Convergence to Equilibria in General-sum Markov
Games [57.568118148036376]
We present the first algorithm for learning in general-sum Markov games that provides sublinear regret guarantees when executed by all agents.
Our algorithm is decentralized, computationally efficient, and does not require any communication between agents.
arXiv Detail & Related papers (2022-07-28T16:27:59Z) - ESCHER: Eschewing Importance Sampling in Games by Computing a History
Value Function to Estimate Regret [97.73233271730616]
Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies)
DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR)
We show that a deep learning version of ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- and the difference becomes dramatic as game size increases.
arXiv Detail & Related papers (2022-06-08T18:43:45Z) - Online Multiobjective Minimax Optimization and Applications [14.699969822572308]
We introduce a simple but general online learning framework, in which an adaptive adversary introduces a new game.
The learner's goal is to play so as to minimize the maximum coordinate of the cumulative vector-valued loss.
We give a simple algorithm that can compete with the setting in which the adversary must announce their action first.
This includes no regret learning: we can recover optimal algorithms and bounds for minimizing external regret, internal regret, adaptive regret, multigroup regret, subsequence regret, and a notion of regret in the sleeping experts setting.
arXiv Detail & Related papers (2021-08-09T06:52:08Z) - Hindsight and Sequential Rationality of Correlated Play [18.176128899338433]
We look at algorithms that ensure strong performance in hindsight relative to what could have been achieved with modified behavior.
We develop and advocate for this hindsight framing of learning in general sequential decision-making settings.
We present examples illustrating the distinct strengths and weaknesses of each type of equilibrium in the literature.
arXiv Detail & Related papers (2020-12-10T18:30:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.