Related papers: Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning

Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning

URL: http://arxiv.org/abs/2503.09981v2
Date: Thu, 02 Oct 2025 14:00:25 GMT
Title: Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning
Authors: Yanwei Jia, Du Ouyang, Yufei Zhang,
Abstract summary: We rigorously analyze a policy execution framework that samples actions from a policy at discrete time points and implements them as piecewise constant controls.<n>We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients according to the policy.<n>Building on these results, we analyze the bias and variance of various policy gradient estimators based on discrete-time observations.
Score: 3.973277434105709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Stochastic policies (also known as relaxed controls) are widely used in continuous-time reinforcement learning algorithms. However, executing a stochastic policy and evaluating its performance in a continuous-time environment remain open challenges. This work introduces and rigorously analyzes a policy execution framework that samples actions from a stochastic policy at discrete time points and implements them as piecewise constant controls. We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients aggregated according to the stochastic policy. We explicitly quantify the convergence rate based on the regularity of the coefficients and establish an optimal first-order convergence rate for sufficiently regular coefficients. Additionally, we prove a $1/2$-order weak convergence rate that holds uniformly over the sampling noise with high probability, and establish a $1/2$-order pathwise convergence for each realization of the system noise in the absence of volatility control. Building on these results, we analyze the bias and variance of various policy evaluation and policy gradient estimators based on discrete-time observations. Our results provide theoretical justification for the exploratory stochastic control framework in [H. Wang, T. Zariphopoulou, and X.Y. Zhou, J. Mach. Learn. Res., 21 (2020), pp. 1-34].

Related papers

Active Bipartite Ranking with Smooth Posterior Distributions [1.9838140219494644]
Bipartite ranking is a statistical learning problem involved in many applications and widely studied in the passive context.<n>We propose a novel algorithm, referred to as smooth-rank and designed for the continuous setting, which aims to minimise the distance between the ROC curve of the estimated ranking rule and the optimal one w.r.t. the $sup$ norm.<n>We provide a problem dependent upper bound on the expected sampling time of smooth-rank and establish a problem dependent lower bound on the expected sampling time of any PAC$(,)$ algorithm.
arXiv Detail & Related papers (2026-02-27T18:32:08Z)
MDP Planning as Policy Inference [1.7259824817932294]
We cast episodic Markov decision process (MDP) planning as Bayesian inference over _policies_.<n>A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return.<n>Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic.
arXiv Detail & Related papers (2026-02-19T13:56:31Z)
Learning Deterministic Policies with Policy Gradients in Constrained Markov Decision Processes [59.27926064817273]
We introduce an exploration-agnostic algorithm, called C-PG, which enjoys global last-iterate convergence guarantees under domination assumptions.<n>We empirically validate both the action-based (C-PGAE) and parameter-based (C-PGPE) variants of C-PG on constrained control tasks.
arXiv Detail & Related papers (2025-06-06T10:29:05Z)
Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates [61.091122503406304]
We show that the gradient bandit algorithm converges to a globally optimal policy almost surely using emphany constant learning rate.<n>This result demonstrates that gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down.
arXiv Detail & Related papers (2025-02-11T00:12:04Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z)
$K$-Nearest-Neighbor Resampling for Off-Policy Evaluation in Stochastic Control [0.6906005491572401]
We propose a novel $K$-nearest neighbor reparametric procedure for estimating the performance of a policy from historical data. Our analysis allows for the sampling of entire episodes, as is common practice in most applications. Compared to other OPE methods, our algorithm does not require optimization, can be efficiently implemented via tree-based nearest neighbor search and parallelization, and does not explicitly assume a parametric model for the environment's dynamics.
arXiv Detail & Related papers (2023-06-07T23:55:12Z)
Actor-Critic learning for mean-field control in continuous time [0.0]
We study policy gradient for mean-field control in continuous time in a reinforcement learning setting. By considering randomised policies with entropy regularisation, we derive a gradient expectation representation of the value function. In the linear-quadratic mean-field framework, we obtain an exact parametrisation of the actor and critic functions defined on the Wasserstein space.
arXiv Detail & Related papers (2023-03-13T10:49:25Z)
The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy. emphnatural policy gradient (NPG) to converge to a globally optimal. policy at an $O (1/t) rate gradient. We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z)
Bounded Robustness in Reinforcement Learning via Lexicographic Objectives [54.00072722686121]
Policy robustness in Reinforcement Learning may not be desirable at any cost. We study how policies can be maximally robust to arbitrary observational noise. We propose a robustness-inducing scheme, applicable to any policy algorithm, that trades off expected policy utility for robustness.
arXiv Detail & Related papers (2022-09-30T08:53:18Z)
Optimal scheduling of entropy regulariser for continuous-time linear-quadratic reinforcement learning [9.779769486156631]
Herein agent interacts with the environment by generating noisy controls distributed according to the optimal relaxed policy. This exploration-exploitation trade-off is determined by the strength of entropy regularisation. We prove that the regret, for both learning algorithms, is of the order $mathcalO(sqrtN) $ (up to a logarithmic factor) over $N$ episodes, matching the best known result from the literature.
arXiv Detail & Related papers (2022-08-08T23:36:40Z)
On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z)
Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.