Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow
- URL: http://arxiv.org/abs/2602.14587v1
- Date: Mon, 16 Feb 2026 09:35:25 GMT
- Title: Decoupled Continuous-Time Reinforcement Learning via Hamiltonian Flow
- Authors: Minh Nguyen,
- Abstract summary: Real-world control problems evolve in continuous time with non-uniform, event-driven decisions.<n>As time gaps shrink, the $Q$-function collapses to the value function $V$, eliminating action ranking.<n>Existing continuous-time methods reintroduce action information via an advantage-rate function $q$.<n>We propose a novel decoupled continuous-time actor-critic algorithm with alternating updates.
- Score: 1.8824572526199168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many real-world control problems, ranging from finance to robotics, evolve in continuous time with non-uniform, event-driven decisions. Standard discrete-time reinforcement learning (RL), based on fixed-step Bellman updates, struggles in this setting: as time gaps shrink, the $Q$-function collapses to the value function $V$, eliminating action ranking. Existing continuous-time methods reintroduce action information via an advantage-rate function $q$. However, they enforce optimality through complicated martingale losses or orthogonality constraints, which are sensitive to the choice of test processes. These approaches entangle $V$ and $q$ into a large, complex optimization problem that is difficult to train reliably. To address these limitations, we propose a novel decoupled continuous-time actor-critic algorithm with alternating updates: $q$ is learned from diffusion generators on $V$, and $V$ is updated via a Hamiltonian-based value flow that remains informative under infinitesimal time steps, where standard max/softmax backups fail. Theoretically, we prove rigorous convergence via new probabilistic arguments, sidestepping the challenge that generator-based Hamiltonians lack Bellman-style contraction under the sup-norm. Empirically, our method outperforms prior continuous-time and leading discrete-time baselines across challenging continuous-control benchmarks and a real-world trading task, achieving 21% profit over a single quarter$-$nearly doubling the second-best method.
Related papers
- Online Markov Decision Processes with Terminal Law Constraints [10.878763806286157]
We introduce a reset-free framework called the periodic framework.<n>The goal is to find periodic policies that minimize cumulative loss and return the agents to their initial state distribution after a fixed number of steps.<n>We show first algorithms for computing periodic policies in two multi-agent settings and show they achieve sublinear periodic regret of order $tilde O(T3/4)$.<n>This provides the first non-asymptotic guarantees for reset-free learning in the setting of $M$ homogeneous agents, for $M > 1$.
arXiv Detail & Related papers (2026-01-12T12:46:12Z) - Learning and certification of local time-dependent quantum dynamics and noise [5.1798081822960365]
Hamiltonian learning protocols are essential tools to benchmark quantum computers and simulators.<n>We learn the time-dependent evolution of a locally interacting $nqubit system on a graph of effective dimensionD$.<n>Our protocol outputs function approximating coefficients to accuracy $epsilon$ on an interval with success probability $1-delta$.
arXiv Detail & Related papers (2025-10-09T17:39:40Z) - From Continual Learning to SGD and Back: Better Rates for Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations.<n>We develop novel last-iterate upper bounds in the realizable least squares setup.<n>We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic in sufficiently long task sequences.
arXiv Detail & Related papers (2025-04-06T18:39:45Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Near-continuous time Reinforcement Learning for continuous state-action
spaces [3.5527561584422456]
We consider the Reinforcement Learning problem of controlling an unknown dynamical system to maximise the long-term average reward along a single trajectory.
Most of the literature considers system interactions that occur in discrete time and discrete state-action spaces.
We show that the celebrated optimism protocol applies when the sub-tasks (learning and planning) can be performed effectively.
arXiv Detail & Related papers (2023-09-06T08:01:17Z) - Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space [0.0]
We study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes.
We propose an algorithm based on Thompson sampling with dynamically-sized episodes.
We show that our algorithm can be applied to develop approximately optimal control algorithms.
arXiv Detail & Related papers (2023-06-05T03:57:16Z) - Non-stationary Delayed Online Convex Optimization: From Full-information to Bandit Setting [71.82716109461967]
We propose an algorithm called Mild-OGD for the full-information case, where delayed gradients are available.<n>We show that the dynamic regret of Mild-OGD can be automatically bounded by $O(sqrtbardT(P_T+1))$ under the in-order assumption.<n>We also develop a bandit variant of Mild-OGD for a more challenging case with only delayed loss values.
arXiv Detail & Related papers (2023-05-20T07:54:07Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Acting in Delayed Environments with Non-Stationary Markov Policies [57.52103323209643]
We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps.
We prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary.
We devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation.
arXiv Detail & Related papers (2021-01-28T13:35:37Z) - Accelerated Learning with Robustness to Adversarial Regressors [1.0499611180329802]
We propose a new discrete time algorithm which provides stability and convergence guarantees in the presence of adversarial regressors.
In particular, our algorithm reaches an $epsilon$ sub-optimal point in at most $tildemathcalO (1/sqrtepsilon)$ when regressors are constant.
arXiv Detail & Related papers (2020-05-04T14:42:49Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.