Related papers: Convergence and stability of Q-learning in Hierarchical Reinforcement Learning

Convergence and stability of Q-learning in Hierarchical Reinforcement Learning

URL: http://arxiv.org/abs/2511.17351v1
Date: Fri, 21 Nov 2025 16:13:53 GMT
Title: Convergence and stability of Q-learning in Hierarchical Reinforcement Learning
Authors: Massimiliano Manenti, Andrea Iannelli,
Abstract summary: We propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable.<n>We show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game.<n>Experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hierarchical Reinforcement Learning promises, among other benefits, to efficiently capture and utilize the temporal structure of a decision-making problem and to enhance continual learning capabilities, but theoretical guarantees lag behind practice. In this paper, we propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable. By leveraging the theory of Stochastic Approximation and the ODE method, we present a theorem stating the convergence and stability properties of Feudal Q-learning. This provides a principled convergence and stability analysis tailored to Feudal RL. Moreover, we show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game, opening the door to game-theoretic approaches to Hierarchical RL. Lastly, experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.

Related papers

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z)
Continual Quantum Architecture Search with Tensor-Train Encoding: Theory and Applications to Signal Processing [68.35481158940401]
CL-QAS is a continual quantum architecture search framework.<n>It mitigates challenges of costly encoding amplitude and forgetting in variational quantum circuits.<n>It achieves controllable robustness expressivity, sample-efficient generalization, and smooth convergence without barren plateaus.
arXiv Detail & Related papers (2026-01-10T02:36:03Z)
Analytic and Variational Stability of Deep Learning Systems [0.0]
We show that uniform boundedness of stability signatures is equivalent to the existence of a Lyapunov-type energy that dissipates along the learning flow.<n>In smooth regimes, the framework yields explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to contractivity of the learning dynamics.<n>The theory extends to non-smooth learning systems, including ReLU networks, proximal and projected updates, and subgradient flows.
arXiv Detail & Related papers (2025-12-24T14:43:59Z)
The Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss [53.542743390809356]
This paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB)<n>Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function.<n>We present a concrete solution that simultaneously achieves both principles via DFT or DWT.
arXiv Detail & Related papers (2025-12-21T06:08:22Z)
OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning [12.77713716713937]
We provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators.<n>We derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients.<n>We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction.
arXiv Detail & Related papers (2025-11-28T16:09:28Z)
Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis [7.2620484413601325]
We analyze the convergence behavior of gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules.<n>We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning.<n>Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay.
arXiv Detail & Related papers (2025-08-05T05:32:36Z)
CTRLS: Chain-of-Thought Reasoning via Latent State-Transition [57.51370433303236]
Chain-of-thought (CoT) reasoning enables large language models to break down complex problems into interpretable intermediate steps.<n>We introduce groundingS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions.<n>We show improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
arXiv Detail & Related papers (2025-07-10T21:32:18Z)
Uncertainty quantification for Markov chain induced martingales with application to temporal difference learning [55.197497603087065]
We analyze the performance of the Temporal Difference (TD) learning algorithm with linear function approximations.<n>We establish novel and general high-dimensional concentration inequalities and Berry-Esseen bounds for vector-valued martingales induced by Markov chains.
arXiv Detail & Related papers (2025-02-19T15:33:55Z)
On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers [25.880499561355904]
This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers.
arXiv Detail & Related papers (2025-02-08T19:26:22Z)
AI Explainability for Power Electronics: From a Lipschitz Continuity Perspective [2.2827888408068624]
This letter proposes a generic framework to evaluate mathematical explainability.<n>Inference stability governs consistent outputs under input perturbations, essential for robust real-time control and fault diagnosis.<n>A Lipschitz-aware learning rate selection strategy is introduced to accelerate convergence while mitigating overshoots and oscillations.
arXiv Detail & Related papers (2025-01-17T04:20:43Z)
Statistical Inference for Temporal Difference Learning with Linear Function Approximation [55.80276145563105]
We investigate the statistical properties of Temporal Difference learning with Polyak-Ruppert averaging.<n>We make three theoretical contributions that improve upon the current state-of-the-art results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z)
Q-Learning for Stochastic Control under General Information Structures and Non-Markovian Environments [1.90365714903665]
We present a convergence theorem for iterations, and iterate in particular, Q-learnings under a general, possibly non-Markovian, environment. We discuss the implications and applications of this theorem to a variety of control problems with non-Markovian environments.
arXiv Detail & Related papers (2023-10-31T19:53:16Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
An Analysis of Quantile Temporal-Difference Learning [53.36758478669685]
quantile temporal-difference learning (QTD) has proven to be a key component in several successful large-scale applications of reinforcement learning. Unlike classical TD learning, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points. This paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1.
arXiv Detail & Related papers (2023-01-11T13:41:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.