Related papers: Provably Efficient Model-Free Constrained RL with Linear Function Approximation

Provably Efficient Model-Free Constrained RL with Linear Function Approximation

URL: http://arxiv.org/abs/2206.11889v1
Date: Thu, 23 Jun 2022 17:54:31 GMT
Title: Provably Efficient Model-Free Constrained RL with Linear Function Approximation
Authors: Arnob Ghosh, Xingyu Zhou, Ness Shroff
Abstract summary: We develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. Our results are achieved via novel adaptations of the standard LSVI-UCB algorithms.
Score: 4.060731229044571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.

Related papers

A single-loop SPIDER-type stochastic subgradient method for expectation-constrained nonconvex nonsmooth optimization [17.25924791071807]
We present a novel type of subgradient algorithm for complex constraints. We show that our method is significantly faster than two-of-the-art algorithms.
arXiv Detail & Related papers (2025-01-31T15:18:52Z)
Settling Constant Regrets in Linear Markov Decision Processes [57.34287648914407]
We study the constant regret guarantees in reinforcement learning (RL) We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) For an MDP characterized by a minimal suboptimality gap $Delta$, Cert-LSVI-UCB has a cumulative regret of $tildemathcalO(d3H5/Delta)$ with high probability, provided that the misspecification level $zeta$ is below $tildemathcalO(Delta / (sqrtd
arXiv Detail & Related papers (2024-04-16T17:23:19Z)
Horizon-Free Regret for Linear Markov Decision Processes [92.02082223856479]
A recent line of works showed regret bounds in reinforcement learning can be (nearly) independent of planning horizon. We give the first horizon-free bound for the popular linear Markov Decision Process (MDP) setting. In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions, we directly estimate the value functions and confidence sets.
arXiv Detail & Related papers (2024-03-15T23:50:58Z)
Reinforcement Learning with General Utilities: Simpler Variance Reduction and Large State-Action Space [17.366915676628867]
We consider the reinforcement learning problem with general utilities. Our algorithm achieves $tildemathcalO(epsilon-3)$ and $tildemathcalO(epsilon-2)$ sample complexities.
arXiv Detail & Related papers (2023-06-02T18:16:35Z)
Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements. We derive an information-theoretic lower bound for the minimax risk under this setting. We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z)
First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces. We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Primal-dual Learning for the Model-free Risk-constrained Linear Quadratic Regulator [0.8629912408966145]
Risk-aware control, though with promise to tackle unexpected events, requires a known exact dynamical model. We propose a model framework to learn a risk-aware controller with a focus on the linear system.
arXiv Detail & Related papers (2020-11-22T04:40:15Z)
Model-based Reinforcement Learning for Continuous Control with Posterior Sampling [10.91557009257615]
We study model-based posterior sampling for reinforcement learning (PSRL) in continuous state-action spaces. We present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection.
arXiv Detail & Related papers (2020-11-20T21:00:31Z)
Online DR-Submodular Maximization with Stochastic Cumulative Constraints [17.660958043781154]
We consider online continuous DR-submodular with linear long-term constraints. Online Lagrangian Frank-Wolfe (OLFW) algorithm to solve this class of online problems.
arXiv Detail & Related papers (2020-05-29T17:55:42Z)
Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation. We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.