Convergence and stability of Q-learning in Hierarchical Reinforcement Learning
- URL: http://arxiv.org/abs/2511.17351v1
- Date: Fri, 21 Nov 2025 16:13:53 GMT
- Title: Convergence and stability of Q-learning in Hierarchical Reinforcement Learning
- Authors: Massimiliano Manenti, Andrea Iannelli,
- Abstract summary: We propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable.<n>We show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game.<n>Experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hierarchical Reinforcement Learning promises, among other benefits, to efficiently capture and utilize the temporal structure of a decision-making problem and to enhance continual learning capabilities, but theoretical guarantees lag behind practice. In this paper, we propose a Feudal Q-learning scheme and investigate under which conditions its coupled updates converge and are stable. By leveraging the theory of Stochastic Approximation and the ODE method, we present a theorem stating the convergence and stability properties of Feudal Q-learning. This provides a principled convergence and stability analysis tailored to Feudal RL. Moreover, we show that the updates converge to a point that can be interpreted as an equilibrium of a suitably defined game, opening the door to game-theoretic approaches to Hierarchical RL. Lastly, experiments based on the Feudal Q-learning algorithm support the outcomes anticipated by theory.
Related papers
- On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z) - Continual Quantum Architecture Search with Tensor-Train Encoding: Theory and Applications to Signal Processing [68.35481158940401]
CL-QAS is a continual quantum architecture search framework.<n>It mitigates challenges of costly encoding amplitude and forgetting in variational quantum circuits.<n>It achieves controllable robustness expressivity, sample-efficient generalization, and smooth convergence without barren plateaus.
arXiv Detail & Related papers (2026-01-10T02:36:03Z) - Analytic and Variational Stability of Deep Learning Systems [0.0]
We show that uniform boundedness of stability signatures is equivalent to the existence of a Lyapunov-type energy that dissipates along the learning flow.<n>In smooth regimes, the framework yields explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to contractivity of the learning dynamics.<n>The theory extends to non-smooth learning systems, including ReLU networks, proximal and projected updates, and subgradient flows.
arXiv Detail & Related papers (2025-12-24T14:43:59Z) - The Procrustean Bed of Time Series: The Optimization Bias of Point-wise Loss [53.542743390809356]
This paper aims to provide a first-principles analysis of the Expectation of Optimization Bias (EOB)<n>Our analysis reveals a fundamental paradigm paradox: the more deterministic and structured the time series, the more severe the bias by point-wise loss function.<n>We present a concrete solution that simultaneously achieves both principles via DFT or DWT.
arXiv Detail & Related papers (2025-12-21T06:08:22Z) - OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning [12.77713716713937]
We provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators.<n>We derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients.<n>We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction.
arXiv Detail & Related papers (2025-11-28T16:09:28Z) - Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis [7.2620484413601325]
We analyze the convergence behavior of gradient descent with momentum (SGDM) under dynamic learning-rate and batch-size schedules.<n>We extend the existing theoretical framework to cover three practical scheduling strategies commonly used in deep learning.<n>Our results reveal a clear hierarchy in convergence: a constant batch size does not guarantee convergence of the expected norm, whereas an increasing batch size does, and simultaneously increasing both the batch size and learning rate achieves a provably faster decay.
arXiv Detail & Related papers (2025-08-05T05:32:36Z) - CTRLS: Chain-of-Thought Reasoning via Latent State-Transition [57.51370433303236]
Chain-of-thought (CoT) reasoning enables large language models to break down complex problems into interpretable intermediate steps.<n>We introduce groundingS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions.<n>We show improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
arXiv Detail & Related papers (2025-07-10T21:32:18Z) - Uncertainty quantification for Markov chain induced martingales with application to temporal difference learning [55.197497603087065]
We analyze the performance of the Temporal Difference (TD) learning algorithm with linear function approximations.<n>We establish novel and general high-dimensional concentration inequalities and Berry-Esseen bounds for vector-valued martingales induced by Markov chains.
arXiv Detail & Related papers (2025-02-19T15:33:55Z) - On the Convergence and Stability of Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning, and Online Decision Transformers [25.880499561355904]
This article provides a rigorous analysis of convergence and stability of Episodic Upside-Down Reinforcement Learning, Goal-Conditioned Supervised Learning and Online Decision Transformers.
arXiv Detail & Related papers (2025-02-08T19:26:22Z) - AI Explainability for Power Electronics: From a Lipschitz Continuity Perspective [2.2827888408068624]
This letter proposes a generic framework to evaluate mathematical explainability.<n>Inference stability governs consistent outputs under input perturbations, essential for robust real-time control and fault diagnosis.<n>A Lipschitz-aware learning rate selection strategy is introduced to accelerate convergence while mitigating overshoots and oscillations.
arXiv Detail & Related papers (2025-01-17T04:20:43Z) - Statistical Inference for Temporal Difference Learning with Linear Function Approximation [55.80276145563105]
We investigate the statistical properties of Temporal Difference learning with Polyak-Ruppert averaging.<n>We make three theoretical contributions that improve upon the current state-of-the-art results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Q-Learning for Stochastic Control under General Information Structures
and Non-Markovian Environments [1.90365714903665]
We present a convergence theorem for iterations, and iterate in particular, Q-learnings under a general, possibly non-Markovian, environment.
We discuss the implications and applications of this theorem to a variety of control problems with non-Markovian environments.
arXiv Detail & Related papers (2023-10-31T19:53:16Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - An Analysis of Quantile Temporal-Difference Learning [53.36758478669685]
quantile temporal-difference learning (QTD) has proven to be a key component in several successful large-scale applications of reinforcement learning.
Unlike classical TD learning, QTD updates do not approximate contraction mappings, are highly non-linear, and may have multiple fixed points.
This paper is a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1.
arXiv Detail & Related papers (2023-01-11T13:41:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.