Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR
- URL: http://arxiv.org/abs/2302.03201v2
- Date: Wed, 24 May 2023 21:47:10 GMT
- Title: Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR
- Authors: Kaiwen Wang and Nathan Kallus and Wen Sun
- Abstract summary: We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
- Score: 58.40575099910538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study risk-sensitive Reinforcement Learning (RL), focusing
on the objective of Conditional Value at Risk (CVaR) with risk tolerance
$\tau$. Starting with multi-arm bandits (MABs), we show the minimax CVaR regret
rate is $\Omega(\sqrt{\tau^{-1}AK})$, where $A$ is the number of actions and
$K$ is the number of episodes, and that it is achieved by an Upper Confidence
Bound algorithm with a novel Bernstein bonus. For online RL in tabular Markov
Decision Processes (MDPs), we show a minimax regret lower bound of
$\Omega(\sqrt{\tau^{-1}SAK})$ (with normalized cumulative rewards), where $S$
is the number of states, and we propose a novel bonus-driven Value Iteration
procedure. We show that our algorithm achieves the optimal regret of
$\widetilde O(\sqrt{\tau^{-1}SAK})$ under a continuity assumption and in
general attains a near-optimal regret of $\widetilde O(\tau^{-1}\sqrt{SAK})$,
which is minimax-optimal for constant $\tau$. This improves on the best
available bounds. By discretizing rewards appropriately, our algorithms are
computationally efficient.
Related papers
- Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
We propose a model-based algorithm based on novel cost and reward function estimators.
Our algorithm achieves a regret upper bound of $widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$ where $bar C$ is the cost budget for an episode.
arXiv Detail & Related papers (2024-10-14T04:51:06Z) - Low-Rank Bandits via Tight Two-to-Infinity Singular Subspace Recovery [45.601316850669406]
We present efficient algorithms for policy evaluation, best policy identification and regret minimization.
For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal.
All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix.
arXiv Detail & Related papers (2024-02-24T06:36:08Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Context-lumpable stochastic bandits [49.024050919419366]
We consider a contextual bandit problem with $S$ contexts and $K$ actions.
We give an algorithm that outputs an $epsilon$-optimal policy after using at most $widetilde O(r (S +K )/epsilon2)$ samples.
In the regret setting, we give an algorithm whose cumulative regret up to time $T$ is bounded by $widetilde O(sqrtr3(S+K)T)$.
arXiv Detail & Related papers (2023-06-22T17:20:30Z) - Instance-optimal PAC Algorithms for Contextual Bandits [20.176752818200438]
In this work, we focus on the bandit problem in the $(epsilon,delta)$-$textitPAC$ setting.
We show that no algorithm can be simultaneously minimax-optimal regret minimization and instance-dependent PAC for best-arm identification.
arXiv Detail & Related papers (2022-07-05T23:19:43Z) - Impact of Representation Learning in Linear Bandits [83.17684841392754]
We study how representation learning can improve the efficiency of bandit problems.
We present a new algorithm which achieves $widetildeO(TsqrtkN + sqrtdkNT)$ regret, where $N$ is the number of rounds we play for each bandit.
arXiv Detail & Related papers (2020-10-13T16:35:30Z) - Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff
in Regret [115.85354306623368]
We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels.
We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ)
We prove that RSVI attains an $tildeObig(lambda(|beta| H2) cdot sqrtH3 S2AT big)$ regret, while RSQ attains an $tildeObig(lambda
arXiv Detail & Related papers (2020-06-22T19:28:26Z) - Stochastic Bandits with Linear Constraints [69.757694218456]
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies.
We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB)
arXiv Detail & Related papers (2020-06-17T22:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.