Provably Efficient Model-Free Algorithms for Non-stationary CMDPs
- URL: http://arxiv.org/abs/2303.05733v1
- Date: Fri, 10 Mar 2023 06:33:38 GMT
- Title: Provably Efficient Model-Free Algorithms for Non-stationary CMDPs
- Authors: Honghao Wei, Arnob Ghosh, Ness Shroff, Lei Ying, Xingyu Zhou
- Abstract summary: We study model-free reinforcement learning algorithms in episodic non-stationary constrained Markov Decision Processes.
In the non-stationary environment, reward, utility functions, and transition kernels can vary arbitrarily over time as long as the cumulative variations do not exceed certain variation budgets.
We propose the first model-free, simulator-free RL algorithms with sublinear regret and zero constraint violation for non-stationary CMDPs.
- Score: 10.930095238723327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study model-free reinforcement learning (RL) algorithms in episodic
non-stationary constrained Markov Decision Processes (CMDPs), in which an agent
aims to maximize the expected cumulative reward subject to a cumulative
constraint on the expected utility (cost). In the non-stationary environment,
reward, utility functions, and transition kernels can vary arbitrarily over
time as long as the cumulative variations do not exceed certain variation
budgets. We propose the first model-free, simulator-free RL algorithms with
sublinear regret and zero constraint violation for non-stationary CMDPs in both
tabular and linear function approximation settings with provable performance
guarantees. Our results on regret bound and constraint violation for the
tabular case match the corresponding best results for stationary CMDPs when the
total budget is known. Additionally, we present a general framework for
addressing the well-known challenges associated with analyzing non-stationary
CMDPs, without requiring prior knowledge of the variation budget. We apply the
approach for both tabular and linear approximation settings.
Related papers
- Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation [24.299769025346368]
We study the reinforcement learning problem in a constrained decision process (CMDP)
We propose an RL algorithm for linear CMDPs that achieves $tildemathcalO(sqrtK)$ regret with an episode-wise zero-violation guarantee.
Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.
arXiv Detail & Related papers (2025-02-14T13:07:25Z) - Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference [33.14076284663493]
We study efficient inference on linear functionals of the $Q$-function in time-invariant Markov Decision Process (MDPs)
These restrictions can reduce the overlap requirement and lower the efficiency bound, yielding more precise estimates.
As a special case, we propose a novel adaptive debiased plug-in estimator that uses isotonic-adaptive fitted $Q$-iteration - a new calibration algorithm for MDPs.
arXiv Detail & Related papers (2025-01-12T20:35:28Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Provably Efficient Algorithm for Nonstationary Low-Rank MDPs [48.92657638730582]
We make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time.
We propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL.
For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with sample complexity.
arXiv Detail & Related papers (2023-08-10T09:52:44Z) - Non-stationary Reinforcement Learning under General Function
Approximation [60.430936031067006]
We first propose a new complexity metric called dynamic Bellman Eluder (DBE) dimension for non-stationary MDPs.
Based on the proposed complexity metric, we propose a novel confidence-set based model-free algorithm called SW-OPEA.
We show that SW-OPEA is provably efficient as long as the variation budget is not significantly large.
arXiv Detail & Related papers (2023-06-01T16:19:37Z) - Provably Efficient Primal-Dual Reinforcement Learning for CMDPs with
Non-stationary Objectives and Constraints [8.840221198764482]
We consider primal-dual-based reinforcement learning (RL) in episodic constrained Markov decision processes (CMDPs) with non-stationary objectives and constraints.
We propose a Periodically Restarted Optimistic Primal-Dual Proximal Policy Optimization (PROPD-PPO) algorithm that features three mechanisms: periodic-restart-based policy improvement, dual update with dual regularization, and periodic-restart-based optimistic policy evaluation.
arXiv Detail & Related papers (2022-01-28T07:18:29Z) - Solving Multistage Stochastic Linear Programming via Regularized Linear
Decision Rules: An Application to Hydrothermal Dispatch Planning [77.34726150561087]
We propose a novel regularization scheme for linear decision rules (LDR) based on the AdaSO (adaptive least absolute shrinkage and selection operator)
Experiments show that the overfit threat is non-negligible when using the classical non-regularized LDR to solve MSLP.
For the LHDP problem, our analysis highlights the following benefits of the proposed framework in comparison to the non-regularized benchmark.
arXiv Detail & Related papers (2021-10-07T02:36:14Z) - A Lyapunov-Based Methodology for Constrained Optimization with Bandit
Feedback [22.17503016665027]
We consider problems where each action returns a random reward, cost, and penalty from an unknown joint distribution.
We propose a novel low-complexity algorithm, named $tt LyOn$, and prove that it achieves $O(sqrtBlog B)$ regret and $O(log B/B)$ constraint-violation.
The low computational cost bounds of $tt LyOn$ suggest that Lyapunov-based algorithm design methodology can be effective in solving constrained bandit optimization problems.
arXiv Detail & Related papers (2021-06-09T16:12:07Z) - Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z) - Nonstationary Reinforcement Learning with Linear Function Approximation [19.521419943509784]
We consider reinforcement learning in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment.
We first develop an optimistic modification of least-squares value with periodic restart, and bound its dynamic regret when variation budgets are known.
We derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al.
arXiv Detail & Related papers (2020-10-08T20:07:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.