Provably Efficient CVaR RL in Low-rank MDPs
- URL: http://arxiv.org/abs/2311.11965v1
- Date: Mon, 20 Nov 2023 17:44:40 GMT
- Title: Provably Efficient CVaR RL in Low-rank MDPs
- Authors: Yulai Zhao, Wenhao Zhan, Xiaoyan Hu, Ho-fung Leung, Farzan Farnia, Wen
Sun, Jason D. Lee
- Abstract summary: We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
- Score: 58.58570425202862
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize
the Conditional Value at Risk (CVaR) with a fixed risk tolerance $\tau$. Prior
theoretical work studying risk-sensitive RL focuses on the tabular Markov
Decision Processes (MDPs) setting. To extend CVaR RL to settings where state
space is large, function approximation must be deployed. We study CVaR RL in
low-rank MDPs with nonlinear function approximation. Low-rank MDPs assume the
underlying transition kernel admits a low-rank decomposition, but unlike prior
linear models, low-rank MDPs do not assume the feature or state-action
representation is known. We propose a novel Upper Confidence Bound (UCB)
bonus-driven algorithm to carefully balance the interplay between exploration,
exploitation, and representation learning in CVaR RL. We prove that our
algorithm achieves a sample complexity of $\tilde{O}\left(\frac{H^7 A^2
d^4}{\tau^2 \epsilon^2}\right)$ to yield an $\epsilon$-optimal CVaR, where $H$
is the length of each episode, $A$ is the capacity of action space, and $d$ is
the dimension of representations. Computational-wise, we design a novel
discretized Least-Squares Value Iteration (LSVI) algorithm for the CVaR
objective as the planning oracle and show that we can find the near-optimal
policy in a polynomial running time with a Maximum Likelihood Estimation
oracle. To our knowledge, this is the first provably efficient CVaR RL
algorithm in low-rank MDPs.
Related papers
- A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - Regularization and Variance-Weighted Regression Achieves Minimax
Optimality in Linear MDPs: Theory and Practice [79.48432795639403]
Mirror descent value iteration (MDVI) is an abstraction of Kullback-Leibler (KL) and entropy-regularized reinforcement learning (RL)
We study MDVI with linear function approximation through its sample complexity required to identify an $varepsilon$-optimal policy.
We present Variance-Weighted Least-Squares MDVI, the first theoretical algorithm that achieves nearly minimax optimal sample complexity for infinite-horizon linear MDPs.
arXiv Detail & Related papers (2023-05-22T16:13:05Z) - Provably Efficient Neural Offline Reinforcement Learning via Perturbed
Rewards [33.88533898709351]
VIPeR amalgamates the randomized value function idea with the pessimism principle.
It implicitly obtains pessimism by simply perturbing the offline data multiple times.
It is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation.
arXiv Detail & Related papers (2023-02-24T17:52:12Z) - Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return.
We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z) - Provably Efficient Risk-Sensitive Reinforcement Learning: Iterated CVaR
and Worst Path [40.4378338001229]
We study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, which aims to maximize the tail of the reward-to-go at each step.
This formulation is applicable to real-world tasks that demand strong risk avoidance throughout the decision process.
arXiv Detail & Related papers (2022-06-06T15:24:06Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Representation Learning for Online and Offline RL in Low-rank MDPs [36.398511188102205]
We focus on the low-rank Markov Decision Processes (MDPs) where the transition dynamics correspond to a low-rank transition matrix.
For the online setting, operating with the same computational oracles used in FLAMBE, we propose an algorithm REP-UCB Upper Confidence Bound Representation learning for RL.
For the offline RL setting, we develop an algorithm that leverages pessimism to learn under a partial coverage condition.
arXiv Detail & Related papers (2021-10-09T22:04:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.