Q-Learning for MDPs with General Spaces: Convergence and Near Optimality
via Quantization under Weak Continuity
- URL: http://arxiv.org/abs/2111.06781v3
- Date: Thu, 7 Sep 2023 17:42:54 GMT
- Title: Q-Learning for MDPs with General Spaces: Convergence and Near Optimality
via Quantization under Weak Continuity
- Authors: Ali Devran Kara, Naci Saldi, Serdar Y\"uksel
- Abstract summary: We show that Q-learning for standard Borel MDPs via quantization of states and actions converges to a limit.
Our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.
- Score: 2.685668802278156
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning algorithms often require finiteness of state and
action spaces in Markov decision processes (MDPs) (also called controlled
Markov chains) and various efforts have been made in the literature towards the
applicability of such algorithms for continuous state and action spaces. In
this paper, we show that under very mild regularity conditions (in particular,
involving only weak continuity of the transition kernel of an MDP), Q-learning
for standard Borel MDPs via quantization of states and actions (called
Quantized Q-Learning) converges to a limit, and furthermore this limit
satisfies an optimality equation which leads to near optimality with either
explicit performance bounds or which are guaranteed to be asymptotically
optimal. Our approach builds on (i) viewing quantization as a measurement
kernel and thus a quantized MDP as a partially observed Markov decision process
(POMDP), (ii) utilizing near optimality and convergence results of Q-learning
for POMDPs, and (iii) finally, near-optimality of finite state model
approximations for MDPs with weakly continuous kernels which we show to
correspond to the fixed point of the constructed POMDP. Thus, our paper
presents a very general convergence and approximation result for the
applicability of Q-learning for continuous MDPs.
Related papers
- Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis [30.713243690224207]
In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences for certain outcomes.
This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees.
arXiv Detail & Related papers (2024-10-31T16:53:20Z) - Quantum Markov Decision Processes: General Theory, Approximations, and Classes of Policies [1.8775413720750924]
We present a novel quantum MDP model aiming to introduce a new framework, algorithms, and future research avenues.
We hope that our approach will pave the way for a new research direction in discrete-time quantum control.
arXiv Detail & Related papers (2024-02-22T15:59:09Z) - Q-Learning for Stochastic Control under General Information Structures
and Non-Markovian Environments [1.90365714903665]
We present a convergence theorem for iterations, and iterate in particular, Q-learnings under a general, possibly non-Markovian, environment.
We discuss the implications and applications of this theorem to a variety of control problems with non-Markovian environments.
arXiv Detail & Related papers (2023-10-31T19:53:16Z) - Optimality Guarantees for Particle Belief Approximation of POMDPs [55.83001584645448]
Partially observable Markov decision processes (POMDPs) provide a flexible representation for real-world decision and control problems.
POMDPs are notoriously difficult to solve, especially when the state and observation spaces are continuous or hybrid.
We propose a theory characterizing the approximation error of the particle filtering techniques that these algorithms use.
arXiv Detail & Related papers (2022-10-10T21:11:55Z) - Convergence of Finite Memory Q-Learning for POMDPs and Near Optimality
of Learned Policies under Filter Stability [0.0]
For POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions.
We present explicit error bounds relating the approximation error to the length of the finite history window.
We show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP.
arXiv Detail & Related papers (2021-03-22T20:14:26Z) - Near Optimality of Finite Memory Feedback Policies in Partially Observed
Markov Decision Processes [0.0]
We study a planning problem for POMDPs where the system dynamics and measurement channel model is assumed to be known.
We find optimal policies for the approximate belief model under mild non-linear filter stability conditions.
We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound.
arXiv Detail & Related papers (2020-10-15T00:37:51Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - A Dynamical Systems Approach for Convergence of the Bayesian EM
Algorithm [59.99439951055238]
We show how (discrete-time) Lyapunov stability theory can serve as a powerful tool to aid, or even lead, in the analysis (and potential design) of optimization algorithms that are not necessarily gradient-based.
The particular ML problem that this paper focuses on is that of parameter estimation in an incomplete-data Bayesian framework via the popular optimization algorithm known as maximum a posteriori expectation-maximization (MAP-EM)
We show that fast convergence (linear or quadratic) is achieved, which could have been difficult to unveil without our adopted S&C approach.
arXiv Detail & Related papers (2020-06-23T01:34:18Z) - Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave
Min-Max Problems with PL Condition [52.08417569774822]
This paper focuses on methods for solving smooth non-concave min-max problems, which have received increasing attention due to deep learning (e.g., deep AUC)
arXiv Detail & Related papers (2020-06-12T00:32:21Z) - Planning in Markov Decision Processes with Gap-Dependent Sample
Complexity [48.98199700043158]
We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process.
We prove an upper bound on the number of calls to the generative models needed for MDP-GapE to identify a near-optimal action with high probability.
arXiv Detail & Related papers (2020-06-10T15:05:51Z) - Theoretical Convergence of Multi-Step Model-Agnostic Meta-Learning [63.64636047748605]
We develop a new theoretical framework to provide convergence guarantee for the general multi-step MAML algorithm.
In particular, our results suggest that an inner-stage step needs to be chosen inversely proportional to $N$ of inner-stage steps in order for $N$ MAML to have guaranteed convergence.
arXiv Detail & Related papers (2020-02-18T19:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.