Local and adaptive mirror descents in extensive-form games
- URL: http://arxiv.org/abs/2309.00656v1
- Date: Fri, 1 Sep 2023 09:20:49 GMT
- Title: Local and adaptive mirror descents in extensive-form games
- Authors: C\^ome Fiegel, Pierre M\'enard, Tadashi Kozuno, R\'emi Munos, Vianney
Perchet, Michal Valko
- Abstract summary: We study how to learn $epsilon$-optimal strategies in zero-sum imperfect information games (IIG) with trajectory feedback.
We consider a fixed sampling approach, where players still update their policies over time, but with observations obtained through a given fixed sampling policy.
We show that this approach guarantees a convergence rate of $tildemathcalO(T-1/2)$ with high probability and has a near-optimal dependence on the game parameters.
- Score: 37.04094644847904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study how to learn $\epsilon$-optimal strategies in zero-sum imperfect
information games (IIG) with trajectory feedback. In this setting, players
update their policies sequentially based on their observations over a fixed
number of episodes, denoted by $T$. Existing procedures suffer from high
variance due to the use of importance sampling over sequences of actions
(Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we
consider a fixed sampling approach, where players still update their policies
over time, but with observations obtained through a given fixed sampling
policy. Our approach is based on an adaptive Online Mirror Descent (OMD)
algorithm that applies OMD locally to each information set, using individually
decreasing learning rates and a regularized loss. We show that this approach
guarantees a convergence rate of $\tilde{\mathcal{O}}(T^{-1/2})$ with high
probability and has a near-optimal dependence on the game parameters when
applied with the best theoretical choices of learning rates and sampling
policies. To achieve these results, we generalize the notion of OMD
stabilization, allowing for time-varying regularization with convex increments.
Related papers
- Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms.
We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths.
We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z) - Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Adaptively Perturbed Mirror Descent for Learning in Games [10.868347525353293]
We propose a payoff perturbation technique for the Mirror Descent (MD) algorithm in games where the gradient of the payoff functions is monotone.
We show that our algorithm exhibits significantly accelerated convergence.
arXiv Detail & Related papers (2023-05-26T04:02:54Z) - Uncertainty Voting Ensemble for Imbalanced Deep Regression [20.176217123752465]
In this paper, we introduce UVOTE, a method for learning from imbalanced data.
We replace traditional regression losses with negative log-likelihood, which also predicts sample-wise aleatoric uncertainty.
We show that UVOTE consistently outperforms the prior art, while at the same time producing better-calibrated uncertainty estimates.
arXiv Detail & Related papers (2023-05-24T14:12:21Z) - Regularization of the policy updates for stabilizing Mean Field Games [0.2348805691644085]
This work studies non-cooperative Multi-Agent Reinforcement Learning (MARL)
MARL where multiple agents interact in the same environment and whose goal is to maximize the individual returns.
We name our algorithm Mean Field Proximal Policy Optimization (MF-PPO), and we empirically show the effectiveness of our method in the OpenSpiel framework.
arXiv Detail & Related papers (2023-04-04T05:45:42Z) - Block Policy Mirror Descent [40.2022466644885]
We present a new class of policy (PG) methods, namely the block policy mirror descent (BPMD)
BPMD is used to solve a class of regularized reinforcement learning (RL) problems with strongly convex regularizers.
This is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning.
arXiv Detail & Related papers (2022-01-15T04:42:02Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Semi-Supervised Learning with Variational Bayesian Inference and Maximum
Uncertainty Regularization [62.21716612888669]
We propose two generic methods for improving semi-supervised learning (SSL)
The first integrates weight perturbation (WP) into existing "consistency regularization" (CR) based methods.
The second method proposes a novel consistency loss called "maximum uncertainty regularization" (MUR)
arXiv Detail & Related papers (2020-12-03T09:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.