Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and
Constant Regret
- URL: http://arxiv.org/abs/2205.12418v1
- Date: Wed, 25 May 2022 00:03:25 GMT
- Title: Tiered Reinforcement Learning: Pessimism in the Face of Uncertainty and
Constant Regret
- Authors: Jiawei Huang, Li Zhao, Tao Qin, Wei Chen, Nan Jiang, Tie-Yan Liu
- Abstract summary: We propose a new learning framework that captures the tiered structure of many real-world user-interaction applications.
We simultaneously maintain two policies $pitextO$ and $pitextE$.
We show that if choosing Pessimistic Value It as the exploitation algorithm to produce $pitextE$, we can achieve a constant regret for risk-averse users.
- Score: 144.06550139857296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new learning framework that captures the tiered structure of
many real-world user-interaction applications, where the users can be divided
into two groups based on their different tolerance on exploration risks and
should be treated separately. In this setting, we simultaneously maintain two
policies $\pi^{\text{O}}$ and $\pi^{\text{E}}$: $\pi^{\text{O}}$ ("O" for
"online") interacts with more risk-tolerant users from the first tier and
minimizes regret by balancing exploration and exploitation as usual, while
$\pi^{\text{E}}$ ("E" for "exploit") exclusively focuses on exploitation for
risk-averse users from the second tier utilizing the data collected so far. An
important question is whether such a separation yields advantages over the
standard online setting (i.e., $\pi^{\text{E}}=\pi^{\text{O}}$) for the
risk-averse users. We individually consider the gap-independent
vs.~gap-dependent settings. For the former, we prove that the separation is
indeed not beneficial from a minimax perspective. For the latter, we show that
if choosing Pessimistic Value Iteration as the exploitation algorithm to
produce $\pi^{\text{E}}$, we can achieve a constant regret for risk-averse
users independent of the number of episodes $K$, which is in sharp contrast to
the $\Omega(\log K)$ regret for any online RL algorithms in the same setting,
while the regret of $\pi^{\text{O}}$ (almost) maintains its online regret
optimality and does not need to compromise for the success of $\pi^{\text{E}}$.
Related papers
- Near-Minimax-Optimal Risk-Sensitive Reinforcement Learning with CVaR [58.40575099910538]
We study risk-sensitive Reinforcement Learning (RL), focusing on the objective of Conditional Value at Risk (CVaR) with risk tolerance $tau$.
We show the minimax CVaR regret rate is $Omega(sqrttau-1AK)$, where $A$ is the number of actions and $K$ is the number of episodes.
We show that our algorithm achieves the optimal regret of $widetilde O(tau-1sqrtSAK)$ under a continuity assumption and in general attains a near
arXiv Detail & Related papers (2023-02-07T02:22:31Z) - Private Online Prediction from Experts: Separations and Faster Rates [74.52487417350221]
Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints.
We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries.
arXiv Detail & Related papers (2022-10-24T18:40:19Z) - Coordinated Attacks against Contextual Bandits: Fundamental Limits and
Defense Mechanisms [75.17357040707347]
Motivated by online recommendation systems, we propose the problem of finding the optimal policy in contextual bandits.
The goal is to robustly learn the policy that maximizes rewards for good users with as few user interactions as possible.
We show we can achieve an $tildeO(min(S,A)cdot alpha/epsilon2)$ upper-bound, by employing efficient robust mean estimators.
arXiv Detail & Related papers (2022-01-30T01:45:13Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Near-Optimal Algorithms for Differentially Private Online Learning in a Stochastic Environment [7.4288915613206505]
We study differentially private online learning problems in a environment under both bandit and full information feedback.
For differentially private bandits, we propose both UCB and Thompson Sampling-based algorithms that are anytime and achieve the optimal $O left(sum_j: Delta_j>0 fracln(T)min leftDelta_j, epsilon right right)$ minimax lower bound.
For the same differentially private full information setting, we also present an $epsilon$-differentially
arXiv Detail & Related papers (2021-02-16T02:48:16Z) - Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive
Multi-Step Bootstrap [84.66885506098724]
This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB)
We show AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps.
We also show AMB suffers an additional $frac|Z_mul|Delta_min$ regret, where $Z_mul$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for
arXiv Detail & Related papers (2021-02-09T07:46:34Z) - Minimax Regret for Stochastic Shortest Path with Adversarial Costs and
Known Transition [37.6975819766632]
We study the shortest path problem with adversarial costs and known transition.
We show that the minimax regret is $widetildeO(sqrtDTstar K)$ and $widetildeO(sqrtDTstar SA K)$ for the full-information setting and the bandit feedback setting.
arXiv Detail & Related papers (2020-12-07T20:55:28Z) - Accommodating Picky Customers: Regret Bound and Exploration Complexity
for Multi-Objective Reinforcement Learning [43.75491612671571]
We consider multi-objective reinforcement learning where the objectives are balanced using preferences.
We formalize this problem as an episodic learning problem on a Markov decision process.
We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $widetildemathcalObigl(sqrtmind,Scdot H3 SA/epsilon2bigr)$.
arXiv Detail & Related papers (2020-11-25T21:45:04Z) - Risk-Sensitive Reinforcement Learning: Near-Optimal Risk-Sample Tradeoff
in Regret [115.85354306623368]
We study risk-sensitive reinforcement learning in episodic Markov decision processes with unknown transition kernels.
We propose two provably efficient model-free algorithms, Risk-Sensitive Value Iteration (RSVI) and Risk-Sensitive Q-learning (RSQ)
We prove that RSVI attains an $tildeObig(lambda(|beta| H2) cdot sqrtH3 S2AT big)$ regret, while RSQ attains an $tildeObig(lambda
arXiv Detail & Related papers (2020-06-22T19:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.