Continuous-time reinforcement learning for optimal switching over multiple regimes
- URL: http://arxiv.org/abs/2512.04697v1
- Date: Thu, 04 Dec 2025 11:48:07 GMT
- Title: Continuous-time reinforcement learning for optimal switching over multiple regimes
- Authors: Yijie Huang, Mengge Li, Xiang Yu, Zhou Zhou,
- Abstract summary: This paper studies the continuous-time reinforcement learning (RL) for optimal switching problems across multiple regimes.<n>We establish the well-posedness of the associated system of Hamilton-Jacobi-Bellman equations and provide a characterization of the optimal policy.<n>A reinforcement learning algorithm is devised and implemented by invoking the policy evaluation based on the martingale characterization.
- Score: 5.045537244224327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the continuous-time reinforcement learning (RL) for optimal switching problems across multiple regimes. We consider a type of exploratory formulation under entropy regularization where the agent randomizes both the timing of switches and the selection of regimes through the generator matrix of an associated continuous-time finite-state Markov chain. We establish the well-posedness of the associated system of Hamilton-Jacobi-Bellman (HJB) equations and provide a characterization of the optimal policy. The policy improvement and the convergence of the policy iterations are rigorously established by analyzing the system of equations. We also show the convergence of the value function in the exploratory formulation towards the value function in the classical formulation as the temperature parameter vanishes. Finally, a reinforcement learning algorithm is devised and implemented by invoking the policy evaluation based on the martingale characterization. Our numerical examples with the aid of neural networks illustrate the effectiveness of the proposed RL algorithm.
Related papers
- ODELoRA: Training Low-Rank Adaptation by Solving Ordinary Differential Equations [54.886931928255564]
Low-rank adaptation (LoRA) has emerged as a widely adopted parameter-efficient fine-tuning method in deep transfer learning.<n>We propose a novel continuous-time optimization dynamic for LoRA factor matrices in the form of an ordinary differential equation (ODE)<n>We show that ODELoRA achieves stable feature learning, a property that is crucial for training deep neural networks at different scales of problem dimensionality.
arXiv Detail & Related papers (2026-02-07T10:19:36Z) - Continuous Policy and Value Iteration for Stochastic Control Problems and Its Convergence [8.65436459753278]
We introduce a continuous policy iteration algorithm where the approximations of the value function of a control problem and the optimal control are simultaneously updated through Langevin-type dynamics.
arXiv Detail & Related papers (2025-06-09T18:20:21Z) - A Non-Asymptotic Theory of Seminorm Lyapunov Stability: From Deterministic to Stochastic Iterative Algorithms [15.764613607477887]
We study the problem of solving fixed-point equations for seminorm-contractive operators.<n>We establish the non-asymptotic behavior of iterative algorithms in both deterministic and foundational settings.
arXiv Detail & Related papers (2025-02-20T02:39:37Z) - Fast Value Tracking for Deep Reinforcement Learning [7.648784748888187]
Reinforcement learning (RL) tackles sequential decision-making problems by creating agents that interact with their environment.
Existing algorithms often view these problem as static, focusing on point estimates for model parameters to maximize expected rewards.
Our research leverages the Kalman paradigm to introduce a novel quantification and sampling algorithm called Langevinized Kalman TemporalTD.
arXiv Detail & Related papers (2024-03-19T22:18:19Z) - Causal Temporal Regime Structure Learning [49.77103348208835]
We present CASTOR, a novel method that concurrently learns the Directed Acyclic Graph (DAG) for each regime.<n>We establish the identifiability of the regimes and DAGs within our framework.<n>Experiments show that CASTOR consistently outperforms existing causal discovery models.
arXiv Detail & Related papers (2023-11-02T17:26:49Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Optimal scheduling of entropy regulariser for continuous-time
linear-quadratic reinforcement learning [9.779769486156631]
Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy.
This exploration-exploitation trade-off is determined by the strength of entropy regularisation.
We prove that the regret, for both learning algorithms, is of the order $mathcalO(sqrtN) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
arXiv Detail & Related papers (2022-08-08T23:36:40Z) - Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time
Reinforcement Learning [39.07307690074323]
We consider the problem of predicting the distribution of returns obtained by an agent interacting in a continuous-time environment.
Accurate return predictions have proven useful for determining optimal policies for risk-sensitive control, state representations, multiagent coordination, and more.
We propose a tractable algorithm for approximately solving the distributional HJB based on a JKO scheme, which can be implemented in an online control algorithm.
arXiv Detail & Related papers (2022-05-24T16:33:54Z) - Robust Value Iteration for Continuous Control Tasks [99.00362538261972]
When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well.
We present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain.
We show that robust value is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm.
arXiv Detail & Related papers (2021-05-25T19:48:35Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - A Distributional Analysis of Sampling-Based Reinforcement Learning
Algorithms [67.67377846416106]
We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes.
We show that value-based methods such as TD($lambda$) and $Q$-Learning have update rules which are contractive in the space of distributions of functions.
arXiv Detail & Related papers (2020-03-27T05:13:29Z) - Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic
Perspectives [97.16266088683061]
The article rigorously establishes why symplectic discretization schemes are important for momentum-based optimization algorithms.
It provides a characterization of algorithms that exhibit accelerated convergence.
arXiv Detail & Related papers (2020-02-28T00:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.