Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds
- URL: http://arxiv.org/abs/2210.14051v3
- Date: Thu, 25 Jan 2024 13:23:05 GMT
- Title: Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds
- Authors: Hao Liang, Zhi-Quan Luo
- Abstract summary: We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return.
We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
- Score: 24.571530193140916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the regret guarantee for risk-sensitive reinforcement learning
(RSRL) via distributional reinforcement learning (DRL) methods. In particular,
we consider finite episodic Markov decision processes whose objective is the
entropic risk measure (EntRM) of return. By leveraging a key property of the
EntRM, the independence property, we establish the risk-sensitive
distributional dynamic programming framework. We then propose two novel DRL
algorithms that implement optimism through two different schemes, including a
model-free one and a model-based one.
We prove that they both attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta|
H)-1}{|\beta|}H\sqrt{S^2AK})$ regret upper bound, where $S$, $A$, $K$, and $H$
represent the number of states, actions, episodes, and the time horizon,
respectively. It matches RSVI2 proposed in \cite{fei2021exponential}, with
novel distributional analysis. To the best of our knowledge, this is the first
regret analysis that bridges DRL and RSRL in terms of sample complexity.
Acknowledging the computational inefficiency associated with the model-free
DRL algorithm, we propose an alternative DRL algorithm with distribution
representation. This approach not only maintains the established regret bounds
but also significantly amplifies computational efficiency.
We also prove a tighter minimax lower bound of $\Omega(\frac{\exp(\beta
H/6)-1}{\beta H}H\sqrt{SAT})$ for the $\beta>0$ case, which recovers the tight
lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.
Related papers
- Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff
in Regret [115.85354306623368]
We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels.
We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ)
We prove that RSVI attains an $tildeObig(lambda(|beta| H2) cdot sqrtH3 S2AT big)$ regret, while RSQ attains an $tildeObig(lambda
arXiv Detail & Related papers (2020-06-22T19:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.