Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation
- URL: http://arxiv.org/abs/2505.14821v1
- Date: Tue, 20 May 2025 18:37:51 GMT
- Title: Sample and Computationally Efficient Continuous-Time Reinforcement Learning with General Function Approximation
- Authors: Runze Zhao, Yue Yu, Adams Yiyue Zhu, Chen Yang, Dongruo Zhou,
- Abstract summary: Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time.<n>We propose a model-based algorithm that achieves both sample and computational efficiency.<n>We show that a near-optimal policy can be learned with a suboptimality gap of $tildeO(sqrtd_mathcalR + d_mathcalFN-1/2)$ using $N$ measurements.
- Score: 28.63391989014238
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continuous-time reinforcement learning (CTRL) provides a principled framework for sequential decision-making in environments where interactions evolve continuously over time. Despite its empirical success, the theoretical understanding of CTRL remains limited, especially in settings with general function approximation. In this work, we propose a model-based CTRL algorithm that achieves both sample and computational efficiency. Our approach leverages optimism-based confidence sets to establish the first sample complexity guarantee for CTRL with general function approximation, showing that a near-optimal policy can be learned with a suboptimality gap of $\tilde{O}(\sqrt{d_{\mathcal{R}} + d_{\mathcal{F}}}N^{-1/2})$ using $N$ measurements, where $d_{\mathcal{R}}$ and $d_{\mathcal{F}}$ denote the distributional Eluder dimensions of the reward and dynamic functions, respectively, capturing the complexity of general function approximation in reinforcement learning. Moreover, we introduce structured policy updates and an alternative measurement strategy that significantly reduce the number of policy updates and rollouts while maintaining competitive sample efficiency. We implemented experiments to backup our proposed algorithms on continuous control tasks and diffusion model fine-tuning, demonstrating comparable performance with significantly fewer policy updates and rollouts.
Related papers
- COPO: Consistency-Aware Policy Optimization [17.328515578426227]
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks.<n>Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization.<n>We propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency.
arXiv Detail & Related papers (2025-08-06T07:05:18Z) - Instance-Dependent Continuous-Time Reinforcement Learning via Maximum Likelihood Estimation [27.232790785138427]
Continuous-time reinforcement learning (CTRL) provides a natural framework for sequential decision-making in dynamic environments.<n>While has shown growing empirical success, its ability to adapt to varying levels of problem difficulty remains poorly understood.<n>In this work, we investigate the instance-dependent behavior of and introduce a simple, model-based algorithm built on maximum likelihood estimation.
arXiv Detail & Related papers (2025-08-04T06:25:45Z) - A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance [3.4354636842203026]
We propose PGDA-RL, a primal-dual Projected Gradient Descent-Ascent algorithm for solving regularized Markov Decision Processes (MDPs)<n>PGDA-RL integrates experience replay-based gradient estimation with a two-timescale decomposition of the underlying nested optimization problem.<n>We prove that PGDA-RL converges almost surely to the optimal value function and policy of the regularized MDP.
arXiv Detail & Related papers (2025-05-07T15:18:43Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Natural Actor-Critic for Robust Reinforcement Learning with Function
Approximation [20.43657369407846]
We study robust reinforcement learning (RL) with the goal of determining a well-performing policy that is robust against model mismatch between the training simulator and the testing environment.
We propose two novel uncertainty set formulations, one based on double sampling and the other on an integral probability metric.
We demonstrate the robust performance of the policy learned by our proposed RNAC approach in multiple MuJoCo environments and a real-world TurtleBot navigation task.
arXiv Detail & Related papers (2023-07-17T22:10:20Z) - Efficient Model-Free Exploration in Low-Rank MDPs [76.87340323826945]
Low-Rank Markov Decision Processes offer a simple, yet expressive framework for RL with function approximation.
Existing algorithms are either (1) computationally intractable, or (2) reliant upon restrictive statistical assumptions.
We propose the first provably sample-efficient algorithm for exploration in Low-Rank MDPs.
arXiv Detail & Related papers (2023-07-08T15:41:48Z) - ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages [37.12048108122337]
This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning.
It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm.
arXiv Detail & Related papers (2023-06-02T11:37:22Z) - Non-stationary Reinforcement Learning under General Function
Approximation [60.430936031067006]
We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs.
Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA.
We show that SW-OPEA is provably efficient as long as the variation budget is not significantly large.
arXiv Detail & Related papers (2023-06-01T16:19:37Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - Provably Correct Optimization and Exploration with Non-linear Policies [65.60853260886516]
ENIAC is an actor-critic method that allows non-linear function approximation in the critic.
We show that under certain assumptions, the learner finds a near-optimal policy in $O(poly(d))$ exploration rounds.
We empirically evaluate this adaptation and show that it outperforms priors inspired by linear methods.
arXiv Detail & Related papers (2021-03-22T03:16:33Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Reinforcement Learning with General Value Function Approximation:
Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
We establish a provably efficient reinforcement learning algorithm with general value function approximation.
We show that our algorithm achieves a regret bound of $widetildeO(mathrmpoly(dH)sqrtT)$ where $d$ is a complexity measure.
Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment.
arXiv Detail & Related papers (2020-05-21T17:36:09Z) - A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes.
We build on the general sample efficiency of off-policy algorithms.
We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.