Near-Optimal Adversarial Reinforcement Learning with Switching Costs
- URL: http://arxiv.org/abs/2302.04374v1
- Date: Wed, 8 Feb 2023 23:41:29 GMT
- Title: Near-Optimal Adversarial Reinforcement Learning with Switching Costs
- Authors: Ming Shi, Yingbin Liang, Ness Shroff
- Abstract summary: We show how to develop a provably efficient algorithm for adversarial RL with switching costs.
Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret is no longer achievable.
We propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known.
- Score: 43.895798638743784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Switching costs, which capture the costs for changing policies, are regarded
as a critical metric in reinforcement learning (RL), in addition to the
standard metric of losses (or rewards). However, existing studies on switching
costs (with a coefficient $\beta$ that is strictly positive and is independent
of $T$) have mainly focused on static RL, where the loss distribution is
assumed to be fixed during the learning process, and thus practical scenarios
where the loss distribution could be non-stationary or even adversarial are not
considered. While adversarial RL better models this type of practical
scenarios, an open problem remains: how to develop a provably efficient
algorithm for adversarial RL with switching costs? This paper makes the first
effort towards solving this problem. First, we provide a regret lower-bound
that shows that the regret of any algorithm must be larger than
$\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are
the number of episodes, states, actions and layers in each episode,
respectively. Our lower bound indicates that, due to the fundamental challenge
of switching costs in adversarial RL, the best achieved regret (whose
dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs
(as well as adversarial RL without switching costs) is no longer achievable.
Moreover, we propose two novel switching-reduced algorithms with regrets that
match our lower bound when the transition function is known, and match our
lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition
function is unknown. Our regret analysis demonstrates the near-optimal
performance of them.
Related papers
- The Limits of Transfer Reinforcement Learning with Latent Low-rank Structure [9.631640936820126]
Many reinforcement learning algorithms are too costly to use in practice due to the large sizes $S, A$ of the problem's state and action space.
We consider the problem of transferring a latent low rank representation when the source and target MDPs have transition kernels.
Our algorithm learns latent representations in each source MDP and then exploits the linear structure to remove the dependence on $S, A $, or $S A$ in the target MDP regret bound.
arXiv Detail & Related papers (2024-10-28T23:12:08Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Bridging Distributional and Risk-sensitive Reinforcement Learning with
Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return.
We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one.
We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z) - The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret
and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL)
We focus on learning with general function classes and general model classes.
We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z) - Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost [31.04961854943877]
We propose a new algorithm that achieves a regret of $widetildeO(sqrtH4S2AT)$ while requiring a switching cost of $O(HSA loglog T)$.
As a byproduct, our new algorithmic techniques allow us to derive a emphreward-free exploration algorithm with an optimal switching cost of $O(HSA)$.
arXiv Detail & Related papers (2022-02-13T19:01:06Z) - Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free
Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning.
The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences.
The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z) - The best of both worlds: stochastic and adversarial episodic MDPs with
unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes.
Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known.
In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z) - Revisiting Smoothed Online Learning [70.09792747315323]
We investigate the problem of smoothed online learning, in which the online learner suffers both a hitting cost and a switching cost.
To bound the competitive ratio, we assume the hitting cost is known to the learner in each round, and investigate the greedy algorithm which simply minimizes the weighted sum of the hitting cost and the switching cost.
arXiv Detail & Related papers (2021-02-13T14:15:55Z) - Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in
Multi-Agent RL and Inventory Control [28.80743320843154]
We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL.
We show that our algorithms are emphnearly optimal by establishing an information-theoretical lower bound of $Omega(Sfrac13 Afrac13 Deltafrac13 Hfrac23 Tfrac23)$, the first lower bound in non-stationary RL.
arXiv Detail & Related papers (2020-10-07T04:55:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.