An Analysis of Quantile Temporal-Difference Learning
- URL: http://arxiv.org/abs/2301.04462v3
- Date: Mon, 20 May 2024 10:03:58 GMT
- Title: An Analysis of Quantile Temporal-Difference Learning
- Authors: Mark Rowland, RĂ©mi Munos, Mohammad Gheshlaghi Azar, Yunhao Tang, Georg Ostrovski, Anna Harutyunyan, Karl Tuyls, Marc G. Bellemare, Will Dabney,
- Abstract summary: quantile temporal-difference learning (QTD) has proven to be a key component in several successful large-scale applications of reinforcement learning.
Unlike classical TD learning, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points.
This paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1.
- Score: 53.36758478669685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We analyse quantile temporal-difference learning (QTD), a distributional reinforcement learning algorithm that has proven to be a key component in several successful large-scale applications of reinforcement learning. Despite these empirical successes, a theoretical understanding of QTD has proven elusive until now. Unlike classical TD learning, which can be analysed with standard stochastic approximation tools, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. The core result of this paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1, putting QTD on firm theoretical footing. The proof establishes connections between QTD and non-linear differential inclusions through stochastic approximation theory and non-smooth analysis.
Related papers
- Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis [82.51626700527837]
Chain-of-shift (CoT) is an efficient method that enables the reasoning ability of large language models by augmenting the query using examples with multiple intermediate steps.
We show that despite the theoretical success of CoT, it fails to provide an accurate generalization when CoT does.
arXiv Detail & Related papers (2024-10-03T03:12:51Z) - Unified ODE Analysis of Smooth Q-Learning Algorithms [5.152147416671501]
Recently, an convergence analysis for Q-learning was introduced using a switching system framework.
We present a more general and unified convergence analysis that improves upon the switching system approach.
arXiv Detail & Related papers (2024-04-20T01:16:27Z) - On the Convergence and Sample Complexity Analysis of Deep Q-Networks
with $\epsilon$-Greedy Exploration [86.71396285956044]
This paper provides a theoretical understanding of Deep Q-Network (DQN) with the $varepsilon$-greedy exploration in deep reinforcement learning.
arXiv Detail & Related papers (2023-10-24T20:37:02Z) - Federated TD Learning over Finite-Rate Erasure Channels: Linear Speedup
under Markovian Sampling [17.870440210358847]
We study a federated policy evaluation problem where agents communicate via a central aggregator to expedite the evaluation of a common policy.
To capture typical communication constraints in FL, we consider finite capacity up-link channels that can drop packets based on a Bernoulli erasure model.
Our work is the first to provide a non-asymptotic analysis of their effects in multi-agent and federated reinforcement learning.
arXiv Detail & Related papers (2023-05-14T08:48:02Z) - Sufficient Exploration for Convex Q-learning [10.75319149461189]
This paper builds on the linear programming (LP) formulation of optimal control of Manne.
A primal version is called logistic Q-learning, and a dual variant is convex Q-learning.
It is shown that convex Q-learning is successful in cases where standard Q-learning diverges.
arXiv Detail & Related papers (2022-10-17T20:22:12Z) - Finite-Time Analysis of Asynchronous Q-learning under Diminishing
Step-Size from Control-Theoretic View [3.5823366350053325]
This paper investigates new finite-time analysis of asynchronous Q-learning under Markovian observation models.
In particular, we introduce a discrete-time time-varying switching system model of Q-learning with diminishing step-sizes.
The proposed analysis brings in additional insights, covers different scenarios, and provides new simplified templates for analysis.
arXiv Detail & Related papers (2022-07-25T14:15:55Z) - Finite-Time Analysis of Asynchronous Q-Learning with Discrete-Time
Switching System Models [6.85316573653194]
We prove that Q-learning with a constant step-size can be naturally formulated as discrete-time switched linear systems.
It offers novel and intuitive insights on Q-learning mainly based on controloretic frameworks.
arXiv Detail & Related papers (2021-02-17T05:32:07Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave
Min-Max Problems with PL Condition [52.08417569774822]
This paper focuses on methods for solving smooth non-concave min-max problems, which have received increasing attention due to deep learning (e.g., deep AUC)
arXiv Detail & Related papers (2020-06-12T00:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.