Finite-Time Analysis of Minimax Q-Learning for Two-Player Zero-Sum
Markov Games: Switching System Approach
- URL: http://arxiv.org/abs/2306.05700v2
- Date: Mon, 12 Jun 2023 08:30:10 GMT
- Title: Finite-Time Analysis of Minimax Q-Learning for Two-Player Zero-Sum
Markov Games: Switching System Approach
- Authors: Donghwan Lee
- Abstract summary: This paper investigates the finite-time analysis of a Q-learning algorithm applied to two-player zero-sum Markov games.
We employ the switching system model of minimax Q-learning and the associated value iteration.
- Score: 3.5823366350053325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this paper is to investigate the finite-time analysis of a
Q-learning algorithm applied to two-player zero-sum Markov games. Specifically,
we establish a finite-time analysis of both the minimax Q-learning algorithm
and the corresponding value iteration method. To enhance the analysis of both
value iteration and Q-learning, we employ the switching system model of minimax
Q-learning and the associated value iteration. This approach provides further
insights into minimax Q-learning and facilitates a more straightforward and
insightful convergence analysis. We anticipate that the introduction of these
additional insights has the potential to uncover novel connections and foster
collaboration between concepts in the fields of control theory and
reinforcement learning communities.
Related papers
- Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning [50.92957910121088]
This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS)
For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium.
We extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.
arXiv Detail & Related papers (2024-04-30T06:48:56Z) - Unified ODE Analysis of Smooth Q-Learning Algorithms [5.152147416671501]
Recently, an convergence analysis for Q-learning was introduced using a switching system framework.
We present a more general and unified convergence analysis that improves upon the switching system approach.
arXiv Detail & Related papers (2024-04-20T01:16:27Z) - Finite-Time Error Analysis of Soft Q-Learning: Switching System Approach [4.36117236405564]
Soft Q-learning is a variation of Q-learning designed to solve entropy regularized Markov decision problems.
This paper aims to offer a novel and unified finite-time, control-theoretic analysis of soft Q-learning algorithms.
arXiv Detail & Related papers (2024-03-11T01:36:37Z) - Finite-Time Analysis of Asynchronous Q-learning under Diminishing
Step-Size from Control-Theoretic View [3.5823366350053325]
This paper investigates new finite-time analysis of asynchronous Q-learning under Markovian observation models.
In particular, we introduce a discrete-time time-varying switching system model of Q-learning with diminishing step-sizes.
The proposed analysis brings in additional insights, covers different scenarios, and provides new simplified templates for analysis.
arXiv Detail & Related papers (2022-07-25T14:15:55Z) - A unified stochastic approximation framework for learning in games [82.74514886461257]
We develop a flexible approximation framework for analyzing the long-run behavior of learning in games (both continuous and finite)
The proposed analysis template incorporates a wide array of popular learning algorithms, including gradient-based methods, exponential/multiplicative weights for learning in finite games, optimistic and bandit variants of the above, etc.
arXiv Detail & Related papers (2022-06-08T14:30:38Z) - Online Target Q-learning with Reverse Experience Replay: Efficiently
finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results.
We present novel methods Q-Rex and Q-RexDaRe.
We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z) - Finite-Time Analysis of Asynchronous Q-Learning with Discrete-Time
Switching System Models [6.85316573653194]
We prove that Q-learning with a constant step-size can be naturally formulated as discrete-time switched linear systems.
It offers novel and intuitive insights on Q-learning mainly based on controloretic frameworks.
arXiv Detail & Related papers (2021-02-17T05:32:07Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Analysis of Q-learning with Adaptation and Momentum Restart for Gradient
Descent [47.3692506462581]
We first characterize the convergence rate for Q-AMSGrad, which is the Q-learning algorithm with AMSGrad update.
To further improve the performance, we propose to incorporate the momentum restart scheme to Q-AMSGrad, resulting in the so-called Q-AMSGradR algorithm.
Our experiments on a linear quadratic regulator problem show that the two proposed Q-learning algorithms outperform the vanilla Q-learning with SGD updates.
arXiv Detail & Related papers (2020-07-15T01:11:43Z) - Periodic Q-Learning [24.099046883918046]
We study the so-called periodic Q-learning algorithm (PQ-learning for short)
PQ-learning maintains two separate Q-value estimates - the online estimate and target estimate.
In contrast to the standard Q-learning, PQ-learning enjoys a simple finite time analysis and achieves better sample for finding an epsilon-optimal policy.
arXiv Detail & Related papers (2020-02-23T00:33:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.