Related papers: Near-Optimal Adversarial Reinforcement Learning with Switching Costs

Near-Optimal Adversarial Reinforcement Learning with Switching Costs

URL: http://arxiv.org/abs/2302.04374v1
Date: Wed, 8 Feb 2023 23:41:29 GMT
Title: Near-Optimal Adversarial Reinforcement Learning with Switching Costs
Authors: Ming Shi, Yingbin Liang, Ness Shroff
Abstract summary: We show how to develop a provably efficient algorithm for adversarial RL with switching costs. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret is no longer achievable. We propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known.
Score: 43.895798638743784
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Switching costs, which capture the costs for changing policies, are regarded as a critical metric in reinforcement learning (RL), in addition to the standard metric of losses (or rewards). However, existing studies on switching costs (with a coefficient $\beta$ that is strictly positive and is independent of $T$) have mainly focused on static RL, where the loss distribution is assumed to be fixed during the learning process, and thus practical scenarios where the loss distribution could be non-stationary or even adversarial are not considered. While adversarial RL better models this type of practical scenarios, an open problem remains: how to develop a provably efficient algorithm for adversarial RL with switching costs? This paper makes the first effort towards solving this problem. First, we provide a regret lower-bound that shows that the regret of any algorithm must be larger than $\tilde{\Omega}( ( H S A )^{1/3} T^{2/3} )$, where $T$, $S$, $A$ and $H$ are the number of episodes, states, actions and layers in each episode, respectively. Our lower bound indicates that, due to the fundamental challenge of switching costs in adversarial RL, the best achieved regret (whose dependency on $T$ is $\tilde{O}(\sqrt{T})$) in static RL with switching costs (as well as adversarial RL without switching costs) is no longer achievable. Moreover, we propose two novel switching-reduced algorithms with regrets that match our lower bound when the transition function is known, and match our lower bound within a small factor of $\tilde{O}( H^{1/3} )$ when the transition function is unknown. Our regret analysis demonstrates the near-optimal performance of them.

Related papers

Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning [4.895986534376972]
We study the problem of on-policy single-agent reinforcement learning (RL) and federated RL (FRL)<n>We propose two novel model-free RL algorithms -- Q-EarlySettled-LowCost and FedQ-EarlySettled-LowCost.
arXiv Detail & Related papers (2025-06-05T04:36:38Z)
The Limits of Transfer Reinforcement Learning with Latent Low-rank Structure [9.631640936820126]
Many reinforcement learning algorithms are too costly to use in practice due to the large sizes $S, A$ of the problem's state and action space. We consider the problem of transferring a latent low rank representation when the source and target MDPs have transition kernels. Our algorithm learns latent representations in each source MDP and then exploits the linear structure to remove the dependence on $S, A $, or $S A$ in the target MDP regret bound.
arXiv Detail & Related papers (2024-10-28T23:12:08Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost. We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z)
Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds [24.571530193140916]
We consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. We propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that they both attain $tildemathcalO(fracexp(|beta| H)-1|beta|HsqrtS2AK)$ regret upper bound, where $S$, $A$, $K$, and $H$ represent the number
arXiv Detail & Related papers (2022-10-25T14:30:48Z)
The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches [84.54669549718075]
We study the problem of regret minimization for episodic Reinforcement Learning (RL) We focus on learning with general function classes and general model classes. We show that a logarithmic regret bound is realizable by algorithms with $O(log T)$ switching cost.
arXiv Detail & Related papers (2022-03-03T02:55:55Z)
Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost [31.04961854943877]
We propose a new algorithm that achieves a regret of $widetildeO(sqrtH4S2AT)$ while requiring a switching cost of $O(HSA loglog T)$. As a byproduct, our new algorithmic techniques allow us to derive a emphreward-free exploration algorithm with an optimal switching cost of $O(HSA)$.
arXiv Detail & Related papers (2022-02-13T19:01:06Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
The best of both worlds: stochastic and adversarial episodic MDPs with unknown transition [49.78053380710322]
We consider the best-of-both-worlds problem for learning an episodic Markov Decision Process through $T$ episodes. Recent work by [Jin and Luo, 2020] achieves this goal when the fixed transition is known. In this work, we resolve this open problem by using the same Follow-the-Regularized-Leader ($textFTRL$) framework together with a set of new techniques.
arXiv Detail & Related papers (2021-06-08T05:46:35Z)
Revisiting Smoothed Online Learning [70.09792747315323]
We investigate the problem of smoothed online learning, in which the online learner suffers both a hitting cost and a switching cost. To bound the competitive ratio, we assume the hitting cost is known to the learner in each round, and investigate the greedy algorithm which simply minimizes the weighted sum of the hitting cost and the switching cost.
arXiv Detail & Related papers (2021-02-13T14:15:55Z)
Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control [28.80743320843154]
We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL. We show that our algorithms are emphnearly optimal by establishing an information-theoretical lower bound of $Omega(Sfrac13 Afrac13 Deltafrac13 Hfrac23 Tfrac23)$, the first lower bound in non-stationary RL.
arXiv Detail & Related papers (2020-10-07T04:55:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.