Provably Efficient Model-Free Constrained RL with Linear Function
Approximation
- URL: http://arxiv.org/abs/2206.11889v1
- Date: Thu, 23 Jun 2022 17:54:31 GMT
- Title: Provably Efficient Model-Free Constrained RL with Linear Function
Approximation
- Authors: Arnob Ghosh, Xingyu Zhou, Ness Shroff
- Abstract summary: We develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems.
Our results are achieved via novel adaptations of the standard LSVI-UCB algorithms.
- Score: 4.060731229044571
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the constrained reinforcement learning problem, in which an agent
aims to maximize the expected cumulative reward subject to a constraint on the
expected total value of a utility function. In contrast to existing model-based
approaches or model-free methods accompanied with a `simulator', we aim to
develop the first model-free, simulator-free algorithm that achieves a
sublinear regret and a sublinear constraint violation even in large-scale
systems. To this end, we consider the episodic constrained Markov decision
processes with linear function approximation, where the transition dynamics and
the reward function can be represented as a linear function of some known
feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and
$\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be
achieved, where $d$ is the dimension of the feature mapping, $H$ is the length
of the episode, and $T$ is the total number of steps. Our bounds are attained
without explicitly estimating the unknown transition model or requiring a
simulator, and they depend on the state space only through the dimension of the
feature mapping. Hence our bounds hold even when the number of states goes to
infinity. Our main results are achieved via novel adaptations of the standard
LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization
into the LSVI-UCB algorithm to balance between regret and constraint violation.
More importantly, we replace the standard greedy selection with respect to the
state-action function in LSVI-UCB with a soft-max policy. This turns out to be
key in establishing uniform concentration for the constrained case via its
approximation-smoothness trade-off. We also show that one can achieve an even
zero constraint violation while still maintaining the same order with respect
to $T$.
Related papers
- Settling Constant Regrets in Linear Markov Decision Processes [57.34287648914407]
We study the constant regret guarantees in reinforcement learning (RL)
We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs)
For an MDP characterized by a minimal suboptimality gap $Delta$, Cert-LSVI-UCB has a cumulative regret of $tildemathcalO(d3H5/Delta)$ with high probability, provided that the misspecification level $zeta$ is below $tildemathcalO(Delta / (sqrtd
arXiv Detail & Related papers (2024-04-16T17:23:19Z) - Horizon-Free Regret for Linear Markov Decision Processes [92.02082223856479]
A recent line of works showed regret bounds in reinforcement learning can be (nearly) independent of planning horizon.
We give the first horizon-free bound for the popular linear Markov Decision Process (MDP) setting.
In contrast to prior works which explicitly estimate the transition model and compute the inhomogeneous value functions, we directly estimate the value functions and confidence sets.
arXiv Detail & Related papers (2024-03-15T23:50:58Z) - Reinforcement Learning with General Utilities: Simpler Variance
Reduction and Large State-Action Space [17.366915676628867]
We consider the reinforcement learning problem with general utilities.
Our algorithm achieves $tildemathcalO(epsilon-3)$ and $tildemathcalO(epsilon-2)$ sample complexities.
arXiv Detail & Related papers (2023-06-02T18:16:35Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - First-Order Regret in Reinforcement Learning with Linear Function
Approximation: A Robust Estimation Approach [57.570201404222935]
We show that it is possible to obtain regret scaling as $mathcalO(sqrtV_1star K)$ in reinforcement learning with large state spaces.
We demonstrate that existing techniques based on at least squares estimation are insufficient to obtain this result.
arXiv Detail & Related papers (2021-12-07T00:29:57Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Primal-dual Learning for the Model-free Risk-constrained Linear
Quadratic Regulator [0.8629912408966145]
Risk-aware control, though with promise to tackle unexpected events, requires a known exact dynamical model.
We propose a model framework to learn a risk-aware controller with a focus on the linear system.
arXiv Detail & Related papers (2020-11-22T04:40:15Z) - Model-based Reinforcement Learning for Continuous Control with Posterior
Sampling [10.91557009257615]
We study model-based posterior sampling for reinforcement learning (PSRL) in continuous state-action spaces.
We present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection.
arXiv Detail & Related papers (2020-11-20T21:00:31Z) - Online DR-Submodular Maximization with Stochastic Cumulative Constraints [17.660958043781154]
We consider online continuous DR-submodular with linear long-term constraints.
Online Lagrangian Frank-Wolfe (OLFW) algorithm to solve this class of online problems.
arXiv Detail & Related papers (2020-05-29T17:55:42Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.