Convergence of regularized agent-state-based Q-learning in POMDPs
- URL: http://arxiv.org/abs/2508.21314v2
- Date: Tue, 02 Sep 2025 23:21:19 GMT
- Title: Convergence of regularized agent-state-based Q-learning in POMDPs
- Authors: Amit Sinha, Matthieu Geist, Aditya Mahajan,
- Abstract summary: We investigate the simplest form of Q-learning algorithms which we call regularized agent-state-based Q-learning (RA)<n>We show that it converges under mild technical conditions.<n>We show that a similar analysis continues to work for a variant of RA policies that learns periodic behavior.
- Score: 24.164262011028246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i)~the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii)~policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.
Related papers
- Periodic Regularized Q-Learning [9.333190920811626]
We propose a new algorithm, periodic regularized Q-learning (PRQ)<n>We provide a rigorous theoretical analysis that proves finite-time convergence guarantees for PRQ under linear function approximation.
arXiv Detail & Related papers (2026-02-03T09:28:06Z) - Periodic agent-state based Q-learning for POMDPs [23.296159073116264]
A widely used alternative is to use an agent state, which is a model-free,periodicly updateable function of the observation history.
We propose PA ( agent-state based Q-learning), which is a variant of agent-state-based Q-learning that learns periodic policies.
By combining ideas from periodic Markov chains and approximation, we rigorously establish that PA converges to a cyclic limit and characterize the approximation error of the periodic policy.
arXiv Detail & Related papers (2024-07-08T16:58:57Z) - Regularized Q-learning through Robust Averaging [3.4354636842203026]
We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner.
One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance.
We show that 2RA Q-learning converges to the optimal policy and analyze its theoretical mean-squared error.
arXiv Detail & Related papers (2024-05-03T15:57:26Z) - Offline Reinforcement Learning with On-Policy Q-Function Regularization [57.09073809901382]
We deal with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy.
We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
arXiv Detail & Related papers (2023-07-25T21:38:08Z) - q-Learning in Continuous Time [11.694169299062597]
We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation.<n>We develop a q-learning" theory around the q-function that is independent of time discretization.<n>We devise different actor-critic algorithms for solving underlying problems.
arXiv Detail & Related papers (2022-07-02T02:20:41Z) - Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training.
We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - A Subgame Perfect Equilibrium Reinforcement Learning Approach to
Time-inconsistent Problems [4.314956204483074]
We establish a subgame perfect equilibrium reinforcement learning framework for time-inconsistent (TIC) problems.
We propose a new class of algorithms, called backward policy iteration (BPI), that solves SPERL and addresses both challenges.
To demonstrate the practical usage of BPI as a training framework, we adapt standard RL simulation methods and derive two BPI-based training algorithms.
arXiv Detail & Related papers (2021-10-27T09:21:35Z) - Online Target Q-learning with Reverse Experience Replay: Efficiently
finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results.
We present novel methods Q-Rex and Q-RexDaRe.
We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.